<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:blogger="http://schemas.google.com/blogger/2008" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr="http://purl.org/syndication/thread/1.0" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" gd:etag="W/&quot;A08ERHYyeCp7ImA9WhFSEk0.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257</id><updated>2013-06-14T04:36:45.890-07:00</updated><title>Probably Overthinking It</title><subtitle type="html">A blog by Allen Downey.</subtitle><link rel="http://schemas.google.com/g/2005#feed" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/posts/default" /><link rel="alternate" type="text/html" href="http://allendowney.blogspot.com/" /><link rel="next" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default?start-index=26&amp;max-results=25&amp;redirect=false&amp;v=2" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><generator version="7.00" uri="http://www.blogger.com">Blogger</generator><openSearch:totalResults>70</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/ProbablyOverthinkingIt" /><feedburner:info uri="probablyoverthinkingit" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId>ProbablyOverthinkingIt</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><entry gd:etag="W/&quot;A0MHSHk4eSp7ImA9WhBaGU4.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-9064051158910643324</id><published>2013-05-30T11:43:00.004-07:00</published><updated>2013-05-30T11:43:59.731-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-05-30T11:43:59.731-07:00</app:edited><title>Belly Button Biodiversity: The End Game</title><content type="html">In the &lt;a href="http://allendowney.blogspot.com/2013/03/belly-button-biodiversity-part-four.html"&gt;previous installment of this saga&lt;/a&gt;, I admitted that my predictions had completely failed, and I outlined the debugging process I began. &amp;nbsp;Then the semester happened, so I didn't get to work on it again until last week.&lt;br /&gt;
&lt;br /&gt;
It turns out that there were several problems, but the algorithm is now calibrating and validating! &amp;nbsp;Before I proceed, I should explain how I am using these words:&lt;br /&gt;
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;Calibrate: Generate fake data from the same model the analysis is based on. &amp;nbsp;Run the analysis on fake data and generate predictive distributions. &amp;nbsp;Check whether the predictive distributions are correct.&lt;/li&gt;
&lt;li&gt;Validate: Using real data, generate a rarefied sample. &amp;nbsp;Run the analysis on the sample and generate predictive distributions. &amp;nbsp;Check whether the predictive distributions are correct.&lt;/li&gt;
&lt;/ul&gt;
&lt;br /&gt;
If the analysis calibrates, but fails to validate, that suggests that there is some difference between the model and reality that is causing a problem. &amp;nbsp;And that turned out to be the case.&lt;br /&gt;
&lt;br /&gt;
Here are the problems I discovered, and what I had to do to fix them:&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;The prior distribution of prevalences&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
For the prior I used a Dirichlet distribution with all parameters set to 1. &amp;nbsp;I neglected to consider the "concentration parameter," which represents the prior belief about how uniform or concentrated the prevalences are. &amp;nbsp;As the concentration parameter approaches 0, prevalences tend to be close to 1 or 0; that is, there tends to be one dominant species and many species with small prevalences. &amp;nbsp;As the concentration parameter gets larger, all species tend to have the same prevalence. &amp;nbsp;It turns out that a concentration parameter near 0.1 yields a distribution of prevalences that resembles real data.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;The prior distribution of n&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
With a smaller concentration parameter, there are more species with small prevalences, so I had to increase the range of n (the number of species). &amp;nbsp;The prior distribution for n is uniform up to an upper bound, where I choose the upper bound to be big enough to avoid cutting off non-negligible probability. &amp;nbsp;I had to increase this upper bound to 1000, which slows the analysis down a little, but it still takes only a few seconds per subject (on my not-very-fast computer).&lt;br /&gt;
&lt;br /&gt;
Up to this point I hadn't discovered any real errors; it was just a matter of choosing appropriate prior distributions, which is ordinary work for Bayesian analysis.&lt;br /&gt;
&lt;br /&gt;
But it turns out there were two legitimate errors.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Bias due to the definition of "unseen"&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
I was consistently underestimating the prevalence of unseen species because of a bias that underlies the definition of "unseen." &amp;nbsp;To see the problem, consider a simple scenario where there are two species, A and B, with equal prevalence. &amp;nbsp;If I only collect one sample, I get A or B with equal probability.&lt;br /&gt;
&lt;br /&gt;
Suppose I am trying to estimate the prevalence of A. &amp;nbsp;If my sample is A, the posterior marginal distribution for the prevalence of A is Beta(2, 1), which has mean 2/3. &amp;nbsp;If the sample is B, the posterior is Beta(1, 2), which has mean 1/3. &amp;nbsp;So the expected posterior mean is the average of 2/3 and 1/3, which is 1/2. &amp;nbsp;That is the actual prevalence of A, so this analysis is unbiased.&lt;br /&gt;
&lt;br /&gt;
But now suppose I am trying to estimate the prevalence of the unseen species. &amp;nbsp;If I draw A, the unseen species is B and the posterior mean is 1/3. &amp;nbsp;If I draw B, the unseen species is A and the posterior mean is 1/3. &amp;nbsp;So either way I believe that the prevalence of the unseen species is 1/3, but it is actually 1/2. &amp;nbsp;Since I did not specify in advance which species is unseen, the result is biased.&lt;br /&gt;
&lt;br /&gt;
This seems obvious in retrospect. &amp;nbsp;So that's embarrassing (the price I pay for this experiment in Open Science), but it is easy to fix:&lt;br /&gt;
&lt;br /&gt;
a) The posterior distribution I generate has the right relative prevalences for the seen species (based on the data) and the right relative prevalences for the unseen species (all the same), but the total prevalence for the unseen species (which I call q) is too low.&lt;br /&gt;
&lt;br /&gt;
b) Fortunately, there is only one part of the analysis where this bias is a problem: when I draw a sample from the posterior distribution. &amp;nbsp;To fix it, I can draw a value of q from the correct posterior distribution (just by running a forward simulation) and then unbias the posterior distribution with the selected value of q.&lt;br /&gt;
&lt;br /&gt;
Here's the code that generates q:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; def RandomQ(self, n):&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; # generate random prevalences&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; dirichlet = thinkbayes.Dirichlet(n, conc=self.conc)&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; prevalences = dirichlet.Random()&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; # generate a simulated sample&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; pmf = thinkbayes.MakePmfFromItems(enumerate(prevalences))&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cdf = pmf.MakeCdf()&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; sample = cdf.Sample(self.num_reads)&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; seen = set(sample)&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; # add up the prevalence of unseen species&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; q = 0&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; for species, prev in enumerate(prevalences):&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; if species not in seen:&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;q += prev&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; return q&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
n is the hypothetical number of species. &amp;nbsp;conc is the concentration parameter. &amp;nbsp;RandomQ creates a Dirichlet distribution, draws a set of prevalences from it, then draws a simulated sample with the appropriate number of reads, and adds up the total prevalence of the species that don't appear in the sample.&lt;br /&gt;
&lt;br /&gt;
And here's the code that unbiases the posterior:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; def Unbias(self, n, m, q_desired):&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; params = self.params[:n].copy()&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; x = sum(params[:m])&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; y = sum(params[m:])&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; a = x + y&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; g = q_desired * a / y&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; f = (a - g * y) / x&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; params[:m] *= f&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; params[m:] *= g&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
n is the hypothetical number of species; m is the number seen in the actual data.&lt;br /&gt;
&lt;br /&gt;
x is the total prevalence of the seen species; y is the total prevalence of the unseen species. &amp;nbsp;f and g are the factors we have to multiply by so that the corrected prevalence of unseen species is q_desired.&lt;br /&gt;
&lt;br /&gt;
After fixing this error, I find that the analysis calibrates nicely.&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-MNkthILZa2E/UadvD8qx8yI/AAAAAAAABE8/3IjI3v2lq7Y/s1600/species5-cal.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://2.bp.blogspot.com/-MNkthILZa2E/UadvD8qx8yI/AAAAAAAABE8/3IjI3v2lq7Y/s400/species5-cal.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
From each predictive distribution I generate credible intervals with ideal percentages 10, 20, ... 90, and then count how often the actual value falls in each interval.&lt;br /&gt;
&lt;br /&gt;
For example, the blue line is the calibration curve for n, the number of species. &amp;nbsp;After 100 runs, the 10% credible interval contains the actual value 9.5% of of the time.The 50% credible interval contains the actual value 51.5% of the time. &amp;nbsp;And the 90% credible interval contains the actual value 88% of the time. &amp;nbsp;These results show that the posterior distribution for n is, in fact, the posterior distribution for n.&lt;br /&gt;
&lt;br /&gt;
The results are similar for q, the prevalence of unseen species, and l, the predicted number of new species seen after additional sampling.&lt;br /&gt;
&lt;br /&gt;
To check whether the analysis validates, I used the dataset collected by the Belly Button Biodiversity project. &amp;nbsp;For each subject with more than 400 reads, I chose a random subset of 100 reads, ran the analysis, and checked the predictive distributions for q and n. &amp;nbsp;I can't check the predictive distribution of n, because I don't know the actual value.&lt;br /&gt;
&lt;br /&gt;
Sadly, the analysis does not validate with the collected data. &amp;nbsp;The reason is:&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;The data do not fit the model&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
The data deviate substantially from the model that underlies the analysis. &amp;nbsp;To see this, I tried this experiment:&lt;br /&gt;
&lt;br /&gt;
a) Use the data to estimate the parameters of the model.&lt;br /&gt;
b) Generate fake samples from the model.&lt;br /&gt;
c) Compare the fake samples to the real data.&lt;br /&gt;
&lt;br /&gt;
Here's a typical result:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-91lHHbiXJPk/UaeGjug4xJI/AAAAAAAABFM/Yp4Zh4rMFME/s1600/species-cdf-B1558.G.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://3.bp.blogspot.com/-91lHHbiXJPk/UaeGjug4xJI/AAAAAAAABFM/Yp4Zh4rMFME/s400/species-cdf-B1558.G.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
The blue line is the CDF of prevalences, in order by rank. &amp;nbsp;The top-ranked species makes up about 27% of the sample. &amp;nbsp;The top 10 species make up about 75%, and the top 100 species make up about 90%.&lt;br /&gt;
&lt;br /&gt;
The green lines show CDFs from 10 fake samples. &amp;nbsp;The model is a good match for the data for the first 10-20 species, but then it deviates substantially. &amp;nbsp;The prevalence of rare species is higher in the data than in the model.&lt;br /&gt;
&lt;br /&gt;
The problem is that the real data seem to come from a mixture of two distributions, one for dominant species and one for rare species. &amp;nbsp;Among the dominant species the concentration parameter is near 0.1. &amp;nbsp;For the rare species, it is much higher; that is, the rare species all have about the same prevalence.&lt;br /&gt;
&lt;br /&gt;
There are two possible explanations: this effect might be real or it might be an artifact of errors in identifying reads. &amp;nbsp;If it's real, I would have to extend my model to account for it. &amp;nbsp;If it is due to errors, it might be possible to clean the data.&lt;br /&gt;
&lt;br /&gt;
I have heard from biologists that when a large sample yields only a single read of a particular species, that read is likely to be in error; that is, the identified species might not actually be present.&lt;br /&gt;
&lt;br /&gt;
So I explored a simple error model with the following features:&lt;br /&gt;
&lt;br /&gt;
1) If a species appears only once after r reads, the probability that the read is bogus is p = (1 - alpha/r), where alpha is a parameter.&lt;br /&gt;
&lt;br /&gt;
2) If a species appears k times after n reads, the probability that all k reads are bogus is p^k.&lt;br /&gt;
&lt;br /&gt;
To clean the data, I compute the probability that each observed species is bogus, and then delete it with the computed probability.&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-dUBoDWG8GtY/UaeKoJS-neI/AAAAAAAABFc/Q9g9nNL-v4U/s1600/species-cdf-B1558.G.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-dUBoDWG8GtY/UaeKoJS-neI/AAAAAAAABFc/Q9g9nNL-v4U/s320/species-cdf-B1558.G.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
With cleaned data (alpha=50), the model fits very nicely. &amp;nbsp;And since the model fits, and the analysis calibrates, we expect the analysis to validate. &amp;nbsp;And it does.&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-81--OM9Q4eY/UaeUPsiWrwI/AAAAAAAABFs/E59AEGbYTyE/s1600/species5-val.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-81--OM9Q4eY/UaeUPsiWrwI/AAAAAAAABFs/E59AEGbYTyE/s320/species5-val.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
For n there is no validation curve because we don't know the actual values. &lt;br /&gt;
&lt;br /&gt;
For q the validation curve is &amp;nbsp;a little off because we only have a lower bound for the prevalence of unseen species, so the actual values used for validation are too high. &lt;br /&gt;
&lt;br /&gt;
But for l the validation curve is quite good, and that's what we are actually trying to predict, after all.&lt;br /&gt;
&lt;br /&gt;
At this point the analysis depends on two free parameters, the concentration parameter and the cleaning parameter, alpha, which controls how much of the data gets discarded as erroneous.&lt;br /&gt;
&lt;br /&gt;
So the next step is to check whether these parameters cross-validate. &amp;nbsp;That is, if we tune the parameters based on a training set, how well do those values do on a test set?&lt;br /&gt;
&lt;br /&gt;
Another next step is to improve the error model. &amp;nbsp;I chose something very simple, and it does a nice job of getting the data to conform to the analysis model, but it is not well motivated. &amp;nbsp;If I can get more information about where the errors are coming from, I could take a Bayesian approach (what else?) and compute the probability that each datum is legit or bogus.&lt;br /&gt;
&lt;br /&gt;
Or if the data are legit and the prevalences are drawn from a mixture of Dirichlet distributions with different concentrations, I will have to extend the analysis accordingly.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Summary&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
There were four good reasons my predictions failed:&lt;br /&gt;
&lt;br /&gt;
1) The prior distribution of prevalences had the wrong concentration parameter.&lt;br /&gt;
&lt;br /&gt;
2) The prior distribution of n was too narrow.&lt;br /&gt;
&lt;br /&gt;
3) I neglected an implicit bias due to the definition of "unseen species."&lt;br /&gt;
&lt;br /&gt;
4) The data deviate from the model the analysis is based on. &amp;nbsp;If we "clean" the data, it fits the model and &amp;nbsp;the analysis validates, but the cleaning process is a bit of a hack.&lt;br /&gt;
&lt;br /&gt;
I was able to solve these problems, but I had to introduce two free parameters, so my algorithm is not as versatile as I hoped. &amp;nbsp;However, it seems like it should be possible to choose these parameters automatically, which would be an improvement.&lt;br /&gt;
&lt;br /&gt;
And now I have to stop, incorporate these corrections into &lt;i&gt;&lt;a href="http://thinkbayes.com/"&gt;Think Bayes&lt;/a&gt;&lt;/i&gt;, and then finish the manuscript!&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/FqdxDWsr1nc" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/9064051158910643324/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/05/belly-button-biodiversity-end-game.html#comment-form" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/9064051158910643324?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/9064051158910643324?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/FqdxDWsr1nc/belly-button-biodiversity-end-game.html" title="Belly Button Biodiversity: The End Game" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/-MNkthILZa2E/UadvD8qx8yI/AAAAAAAABE8/3IjI3v2lq7Y/s72-c/species5-cal.png" height="72" width="72" /><thr:total>1</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/05/belly-button-biodiversity-end-game.html</feedburner:origLink></entry><entry gd:etag="W/&quot;C0MGSXw-eip7ImA9WhBaGUs.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-2135202565084357532</id><published>2013-05-28T13:09:00.000-07:00</published><updated>2013-05-30T17:50:28.252-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-05-30T17:50:28.252-07:00</app:edited><title>Python Epistemology at PyCon Taiwan</title><content type="html">This weekend I gave a talk entitled "Python Epistemology" for PyCon Taiwan 2013. &amp;nbsp;I would have loved to be in Taipei for the talk, but sadly I was in an empty room in front of a teleconference screen.&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Python Epistemology: PyCon Taiwan 2013&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div class="sites-embed-align-left-wrapping-off"&gt;
&lt;div class="sites-embed-border-on sites-embed" style="width: 410px;"&gt;
&lt;div class="sites-embed-object-title" style="display: none;"&gt;
&lt;span style="font-size: small;"&gt;Python Epistemology: PyCon Taiwan 2013&lt;/span&gt;&lt;/div&gt;
&lt;div class="sites-embed-content sites-embed-type-punch"&gt;
&lt;iframe allowfullscreen="true" frameborder="0" height="337" id="375413249" mozallowfullscreen="true" src="https://docs.google.com/presentation/d/1xEim-cnkUORU_tLBT1P-wnJ78xU_lbydAOdkrszps_M/embed?authuser=0&amp;amp;hl=en&amp;amp;size=s" title="Python Epistemology: PyCon Taiwan 2013" webkitallowfullscreen="true" width="410"&gt;&lt;/iframe&gt;&lt;/div&gt;
&lt;div class="sites-embed-footer"&gt;
&lt;div class="sites-embed-footer-icon sites-punch-icon"&gt;
&lt;/div&gt;
&lt;span style="font-size: small;"&gt;&lt;a href="https://docs.google.com/presentation/d/1xEim-cnkUORU_tLBT1P-wnJ78xU_lbydAOdkrszps_M/edit?authuser=0" target="_blank"&gt;Open &lt;i&gt;Python Epistemology: PyCon Taiwan 2013&lt;/i&gt;&lt;/a&gt;&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;span style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
As I explained, the title is more grandiose than accurate. &amp;nbsp;In general, epistemology is the theory of knowledge: how we know what we think we know, etc. &amp;nbsp;This talk is mostly about what Python has taught me about programming, and how programming in Python has changed the way I learn and the way I think.&lt;br /&gt;
&lt;br /&gt;
About programming, I wrote:&lt;br /&gt;
&lt;br /&gt;
&lt;b id="docs-internal-guid-6076abdb-eca0-300d-3d0d-049e0325f9fa" style="font-weight: normal;"&gt;&lt;span style="color: #0b5394; font-family: Arial; font-size: 40px; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;
&lt;div dir="ltr" style="line-height: 1; margin-bottom: 0pt; margin-top: 6pt;"&gt;
&lt;b id="docs-internal-guid-6076abdb-eca0-300d-3d0d-049e0325f9fa" style="font-weight: normal;"&gt;&lt;span style="color: #0b5394;"&gt;&lt;span style="font-family: Arial; vertical-align: baseline; white-space: pre-wrap;"&gt;Programming is not about translating a well-known solution into code, it is about &lt;/span&gt;&lt;span style="font-family: Arial; vertical-align: baseline; white-space: pre-wrap;"&gt;discovering solutions&lt;/span&gt;&lt;span style="font-family: Arial; vertical-align: baseline; white-space: pre-wrap;"&gt; by writing code, and then &lt;/span&gt;&lt;span style="font-family: Arial; vertical-align: baseline; white-space: pre-wrap;"&gt;creating the language &lt;/span&gt;&lt;span style="font-family: Arial; vertical-align: baseline; white-space: pre-wrap;"&gt;to ex&lt;/span&gt;&lt;/span&gt;&lt;span style="color: #0b5394; font-family: Arial; vertical-align: baseline; white-space: pre-wrap;"&gt;press them.&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
I gave an example using the Counter data structure to check for anagrams:&lt;br /&gt;
&lt;br /&gt;
&lt;b id="docs-internal-guid-49cec35d-eca5-1727-986a-2455e93a534d" style="font-weight: normal;"&gt;&lt;/b&gt;&lt;br /&gt;
&lt;div dir="ltr" style="line-height: 1; margin-bottom: 0pt; margin-top: 6pt;"&gt;
&lt;b id="docs-internal-guid-49cec35d-eca5-1727-986a-2455e93a534d" style="font-weight: normal;"&gt;&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; vertical-align: baseline; white-space: pre-wrap;"&gt;from collections import Counter&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;
&lt;b id="docs-internal-guid-49cec35d-eca5-1727-986a-2455e93a534d" style="font-weight: normal;"&gt;&lt;span style="font-family: Courier New, Courier, monospace;"&gt;&lt;br /&gt;&lt;span style="color: #38761d; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;
&lt;div dir="ltr" style="line-height: 1; margin-bottom: 0pt; margin-top: 6pt;"&gt;
&lt;b id="docs-internal-guid-49cec35d-eca5-1727-986a-2455e93a534d" style="font-weight: normal;"&gt;&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; vertical-align: baseline; white-space: pre-wrap;"&gt;def is_anagram(word1, word2):&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;
&lt;b id="docs-internal-guid-49cec35d-eca5-1727-986a-2455e93a534d" style="font-weight: normal;"&gt;&lt;span style="font-family: Courier New, Courier, monospace;"&gt;
&lt;/span&gt;&lt;div dir="ltr" style="line-height: 1; margin-bottom: 0pt; margin-top: 6pt;"&gt;
&lt;span style="font-family: Courier New, Courier, monospace;"&gt;&lt;span style="color: #38761d; vertical-align: baseline; white-space: pre-wrap;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp;return Counter(word1) == &lt;/span&gt;&lt;b id="docs-internal-guid-49cec35d-eca5-1727-986a-2455e93a534d" style="font-weight: normal;"&gt;&lt;span style="color: #38761d; vertical-align: baseline; white-space: pre-wrap;"&gt;Counter(word2)&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This is a nice solution because it is concise and demonstrably correct, but I suggested that one limitation is that it does not extend easily to handle "The Scrabble Problem": given a set of tiles, check to see whether you can spell a given word.&lt;br /&gt;
&lt;br /&gt;
We can define a new class, called Multiset, that extends Counter and provides&amp;nbsp;&lt;b id="docs-internal-guid-49cec358-eca8-4c62-68e7-1bd391b04a37" style="font-weight: normal;"&gt;&lt;/b&gt;&lt;br /&gt;
&lt;div dir="ltr" style="display: inline !important; line-height: 1; margin-bottom: 0pt; margin-top: 6pt;"&gt;
&lt;b id="docs-internal-guid-49cec358-eca8-4c62-68e7-1bd391b04a37" style="font-weight: normal;"&gt;&lt;span style="color: #38761d; font-family: 'Ubuntu Mono'; vertical-align: baseline; white-space: pre-wrap;"&gt;is_subset&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;
:&lt;br /&gt;
&lt;br /&gt;
&lt;b id="docs-internal-guid-49cec358-eca8-4c62-68e7-1bd391b04a37" style="font-weight: normal;"&gt;&lt;/b&gt;&lt;br /&gt;
&lt;div dir="ltr" style="line-height: 1; margin-bottom: 0pt; margin-top: 6pt;"&gt;
&lt;b id="docs-internal-guid-49cec358-eca8-4c62-68e7-1bd391b04a37" style="font-weight: normal;"&gt;&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; vertical-align: baseline; white-space: pre-wrap;"&gt;class Multiset(Counter):&lt;br class="kix-line-break" /&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp;"""A set with repeated elements."""&lt;br class="kix-line-break" /&gt;&lt;br class="kix-line-break" /&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp;def is_subset(self, other):&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;
&lt;div dir="ltr" style="line-height: 1; margin-bottom: 0pt; margin-top: 6pt;"&gt;
&lt;b id="docs-internal-guid-49cec358-eca8-4c62-68e7-1bd391b04a37" style="font-weight: normal;"&gt;&lt;span style="color: #38761d; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span style="font-family: Courier New, Courier, monospace;"&gt;&lt;br class="kix-line-break" /&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;for char, count in self.items():&lt;br class="kix-line-break" /&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;if other[char] &amp;lt; count:&lt;br class="kix-line-break" /&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;return False&lt;br class="kix-line-break" /&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;return True&lt;/span&gt;&lt;span style="font-family: 'Ubuntu Mono'; font-size: x-large;"&gt;&lt;br class="kix-line-break" /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b id="docs-internal-guid-49cec358-eca8-4c62-68e7-1bd391b04a37" style="font-weight: normal;"&gt;&lt;span style="color: #38761d; font-family: 'Ubuntu Mono'; font-size: 32px; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div&gt;
Now we can write&amp;nbsp;&lt;b id="docs-internal-guid-49cec31a-eca9-dd70-d516-a8c4ac4562cd" style="font-weight: normal;"&gt;&lt;span style="color: #38761d; font-family: 'Ubuntu Mono'; vertical-align: baseline; white-space: pre-wrap;"&gt;can_spell&lt;/span&gt;&lt;/b&gt;&amp;nbsp;concisely:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b id="docs-internal-guid-49cec31a-eca9-dd70-d516-a8c4ac4562cd" style="font-weight: normal;"&gt;&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; vertical-align: baseline; white-space: pre-wrap;"&gt;def can_spell(word, tiles):&lt;br class="kix-line-break" /&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp;return Multiset(word).is_subset(Multiset(tiles))&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: inherit;"&gt;I summarized by quoting Paul Graham:&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b id="docs-internal-guid-44258302-ecaa-03f6-5c1a-4a23594cbb05" style="font-weight: normal;"&gt;&lt;/b&gt;&lt;br /&gt;
&lt;div dir="ltr" style="line-height: 1; margin-bottom: 0pt; margin-top: 6pt;"&gt;
&lt;b id="docs-internal-guid-44258302-ecaa-03f6-5c1a-4a23594cbb05" style="font-weight: normal;"&gt;&lt;span style="color: #0b5394; font-family: Arial; font-style: italic; vertical-align: baseline; white-space: pre-wrap;"&gt;"... you do&lt;/span&gt;&lt;span style="color: #0b5394;"&gt;&lt;span style="font-family: Arial; font-style: italic; vertical-align: baseline; white-space: pre-wrap;"&gt;n't just write your program down toward the language, you also &lt;/span&gt;&lt;span style="font-family: Arial; font-style: italic; vertical-align: baseline; white-space: pre-wrap;"&gt;build the language up&lt;/span&gt;&lt;span style="font-family: Arial; font-style: italic; vertical-align: baseline; white-space: pre-wrap;"&gt; toward your program.&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;
&lt;b id="docs-internal-guid-44258302-ecaa-03f6-5c1a-4a23594cbb05" style="font-weight: normal;"&gt;
&lt;span style="color: #0b5394;"&gt;&lt;br /&gt;&lt;span style="font-family: Arial; font-style: italic; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;div dir="ltr" style="line-height: 1; margin-bottom: 0pt; margin-top: 6pt;"&gt;
&lt;span style="color: #0b5394;"&gt;&lt;span style="font-family: Arial; font-style: italic; vertical-align: baseline; white-space: pre-wrap;"&gt;"In the end your program will look as if the language had been designed for it. And ... you end up with code which is &lt;/span&gt;&lt;span style="font-family: Arial; font-style: italic; vertical-align: baseline; white-space: pre-wrap;"&gt;clear, small, and efficient&lt;/span&gt;&lt;span style="font-family: Arial; font-style: italic; vertical-align: baseline; white-space: pre-wrap;"&gt;."&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;br /&gt;&lt;span style="color: #0b5394; font-family: Arial; font-style: italic; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;span style="color: #0b5394; font-family: Arial; vertical-align: baseline; white-space: pre-wrap;"&gt;Paul Graham, "Programming Bottom Up," 1993.&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: inherit;"&gt;In the second half of the talk, I suggested that Python and other modern programming languages provide a new approach to solving problems. &amp;nbsp;Traditionally, we tend to think and explore using natural language, do analysis and solve problems using mathematical notation, and then translate solutions from math notation into programming languages.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: inherit;"&gt;In some sense, we are always doing two translations, from natural language to math and from math to a computer program. &amp;nbsp;With the previous generation of programming languages, this process was probably necessary (for reasons I explained), but I claim that it is less necessary now, and that it might be possible and advantageous to skip the intermediate mathematics and do analysis and problem-solving directly in programming languages.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: inherit;"&gt;After the talk, I got two interesting questions by email. &amp;nbsp;Yung-Cheng Lin suggested that although programming languages are more precise than natural language, they might not be sufficiently precise to replace mathematical notation, and he asked if I think that using programming to teach mathematical concepts might cause misunderstandings for students.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: inherit;"&gt;I replied:&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="background-color: white; font-family: arial, sans-serif;"&gt;&lt;i&gt;&lt;span style="color: #0b5394;"&gt;I understand what you mean when you say that programming languages are less rigorous that mathematical notation. &amp;nbsp;I think many people have the same impression, but I wonder if it is a real difference or a bias we have.&lt;/span&gt;&lt;/i&gt;&lt;/span&gt;&lt;br /&gt;
&lt;div style="background-color: white; font-family: arial, sans-serif;"&gt;
&lt;i&gt;&lt;span style="color: #0b5394;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;
&lt;div style="background-color: white; font-family: arial, sans-serif;"&gt;
&lt;i&gt;&lt;span style="color: #0b5394;"&gt;I would argue that programming languages and math notation are similar in the sense that they are both formal languages designed by people to express particular ideas concisely and precisely.&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;
&lt;div style="background-color: white; font-family: arial, sans-serif;"&gt;
&lt;i&gt;&lt;span style="color: #0b5394;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;
&lt;div style="background-color: white; font-family: arial, sans-serif;"&gt;
&lt;i&gt;&lt;span style="color: #0b5394;"&gt;There are some kinds of work that are easier to do in math notation, like algebraic manipulation, but other kinds of work that are easier in programming languages, like specifying computations, especially computations with state.&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;
&lt;div style="background-color: white; font-family: arial, sans-serif;"&gt;
&lt;i&gt;&lt;span style="color: #0b5394;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;
&lt;div style="background-color: white; font-family: arial, sans-serif;"&gt;
&lt;i&gt;&lt;span style="color: #0b5394;"&gt;You asked if there is a danger that students might misunderstand mathematical ideas if they come to them through programming, rather than mathematical instruction. &amp;nbsp;I'm sure it's possible, but I don't think the danger is specific to the programming approach.&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;
&lt;div style="background-color: white; font-family: arial, sans-serif;"&gt;
&lt;i&gt;&lt;span style="color: #0b5394;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;
&lt;div style="background-color: white; font-family: arial, sans-serif;"&gt;
&lt;i&gt;&lt;span style="color: #0b5394;"&gt;And on the other side, I think a computational approach to mathematical topics creates opportunities for deeper understanding by running experiments, and (as I said in the talk) by getting your ideas out of your head and into a program so that, by debugging the program, you are also debugging your own understanding.&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 12.800000190734863px;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style="background-color: white;"&gt;
&lt;div style="color: black;"&gt;
&lt;span style="font-family: inherit;"&gt;In response to some of my comments about pseudocode, A. T. Cheng wrote:&lt;/span&gt;&lt;/div&gt;
&lt;div style="color: black; font-family: 'Times New Roman'; font-size: medium;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style="font-family: 'Times New Roman';"&gt;
&lt;span style="font-family: arial, sans-serif;"&gt;&lt;i&gt;&lt;span style="color: #0b5394;"&gt;When we do algorithms or pseudocodes in the traditional way, we used to figure out the time complexity at the same time. But the Python examples you showed us, it seems not so easy to learn the time complexity in the first place. So, does it mean that when we think Python, we don't really care about the time complexity that much?&lt;/span&gt;&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="color: #222222; font-family: arial, sans-serif; font-size: 12.800000190734863px;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style="color: #222222;"&gt;
&lt;div style="color: black;"&gt;
&lt;span style="font-family: inherit;"&gt;I replied:&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div style="color: #222222; font-family: arial, sans-serif; font-size: 12.800000190734863px;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style="font-family: arial, sans-serif;"&gt;
&lt;div&gt;
&lt;i&gt;&lt;span style="color: #0b5394;"&gt;You are right that it can be more difficult to analyze a Python program; you have to know a lot about how the Python data structures are implemented. &amp;nbsp;And there are some gotchas; for example, it takes constant time to add elements to the end of a list, but linear time to add elements in the beginning or the middle.&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;
&lt;div&gt;
&lt;i&gt;&lt;span style="color: #0b5394;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;
&lt;div&gt;
&lt;i&gt;&lt;span style="color: #0b5394;"&gt;It would be better if Python made these performance characteristics part of the interface, but they are not. &amp;nbsp;In fact, some implementations have changed over time; for example, the += operator on lists used to create a new list. &amp;nbsp;Now it is equivalent to append.&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div style="color: #222222; font-family: arial, sans-serif; font-size: 12.800000190734863px;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style="color: #222222;"&gt;
&lt;div style="color: black;"&gt;
&lt;span style="font-family: inherit;"&gt;Thanks to both of my correspondents for these questions (and for permission to quote them). &amp;nbsp;And thanks to the organizers of PyCon Taiwan, especially Albert Chun-Chieh Huang, for inviting me to speak. &amp;nbsp;I really enjoyed it.&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div style="color: #222222; font-family: arial, sans-serif; font-size: 12.800000190734863px;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/iyyxDy25910" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/2135202565084357532/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/05/python-epistemology-at-pycon-taiwan.html#comment-form" title="5 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/2135202565084357532?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/2135202565084357532?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/iyyxDy25910/python-epistemology-at-pycon-taiwan.html" title="Python Epistemology at PyCon Taiwan" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><thr:total>5</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/05/python-epistemology-at-pycon-taiwan.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CkUBQXY5fCp7ImA9WhBbEk8.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-3870167850612108696</id><published>2013-05-09T10:35:00.002-07:00</published><updated>2013-05-10T14:10:50.824-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-05-10T14:10:50.824-07:00</app:edited><title>The Red Line problem</title><content type="html">I've just added a new chapter to &lt;a href="http://thinkbayes.com/"&gt;&lt;i&gt;Think Bayes&lt;/i&gt;&lt;/a&gt;; it is a case study based on a class project two of my students worked on this semester. &amp;nbsp;It presents "The Red Line Problem," which is the problem of predicting the time until the next train arrives, based on the number of passengers on the platform.&lt;br /&gt;
&lt;br /&gt;
Here's the introduction:&lt;br /&gt;
&lt;br /&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;i&gt;In Boston, the Red Line is a subway that runs north-south from Cambridge to Boston. &amp;nbsp;When I was working in Cambridge I took the Red Line from Kendall Square to South Station and caught the commuter rail to Needham. &amp;nbsp;During rush hour Red Line trains run every 7--8 minutes, on average.&lt;/i&gt;&lt;/blockquote&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;i&gt;When I arrived at the station, I could estimate the time until the next train based on the number of passengers on the platform. &amp;nbsp;If there were only a few people, I inferred that I just missed a train and expected to wait about 7 minutes. &amp;nbsp;If there were more passengers, I expected the train to arrive sooner. &amp;nbsp;But if there were a large number of passengers, I suspected that trains were not running on schedule, so I would go back to the street level and get a taxi.&lt;/i&gt;&amp;nbsp;&lt;/blockquote&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;i&gt;While I was waiting for trains, I thought about how Bayesian estimation could help predict my wait time and decide when I should give up and take a taxi. &amp;nbsp;This chapter presents the analysis I came up with.&lt;/i&gt;&lt;/blockquote&gt;
&lt;br /&gt;
&lt;br /&gt;
Sadly, this problem has been overtaken by history: the Red Line now provides real-time estimates for the arrival of the next train. &amp;nbsp;But I think the analysis is interesting, and still applies for subway systems that don't provide estimates.&lt;br /&gt;
&lt;br /&gt;
One interesting tidbit:&lt;br /&gt;
&lt;br /&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;i&gt;As it turns out, the average time between trains, as seen by a random passenger, is substantially higher than the true average.&lt;/i&gt;&amp;nbsp;&lt;/blockquote&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;i&gt;Why? Because a passenger is more like to arrive during a large interval than a small one. Consider a simple example: suppose that the time between trains is either 5 minutes or 10 minutes with equal probability. In that case the average time between trains is 7.5 minutes.&lt;br /&gt;But a passenger is more likely to arrive during a 10 minute gap than a 5 minute gap; in fact, twice as likely. If we surveyed arriving passengers, we would find that 2/3 of them arrived during a 10 minute gap, and only 1/3 during a 5 minute gap. So the average time between trains, as seen by an arriving passenger, is 8.33 minutes.&lt;/i&gt;&amp;nbsp;&lt;/blockquote&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;i&gt;This kind of &lt;b&gt;observer bias&lt;/b&gt; appears in many contexts. Students think that classes are bigger than they are, because more of them are in the big classes. Airline passengers think that planes are fuller than they are, because more of them are on full flights.&lt;/i&gt;&amp;nbsp;&lt;/blockquote&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;i&gt;In each case, values from the actual distribution are oversampled in proportion to their value. In the Red Line example, a gap that is twice as big is twice as likely to be observed.&lt;/i&gt;&lt;/blockquote&gt;
&lt;br /&gt;
&lt;br /&gt;
The data for the Red Line are close to this example. &amp;nbsp;The actual time between trains is 7.6 minutes (based on 45 trains that arrived at Kendall square between 4pm and 6pm so far this week). &amp;nbsp;The average gap as seen by random passengers is 8.3 minutes.&lt;br /&gt;
&lt;br /&gt;
Interestingly, &lt;a href="http://www.mbta.com/uploadedFiles/Documents/Schedules_and_Maps/Subway/frequency-schedule.pdf"&gt;the Red Line schedule&lt;/a&gt; reports that trains run every 9 minutes during peak times. This is close to the average seen by passengers, but higher than the true average. I wonder if they are deliberately reporting the mean as seen by passengers in order to forestall complaints.&lt;br /&gt;
&lt;br /&gt;
You can &lt;a href="http://www.greenteapress.com/thinkbayes/html/thinkbayes009.html"&gt;read the rest of the chapter here&lt;/a&gt;. &amp;nbsp;One of the figures there didn't render very well. &amp;nbsp;Here is a prettier version:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-b5Z0llS4T30/UYvbOzE5PVI/AAAAAAAABD4/YzzlFHzYD1o/s1600/redline4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://4.bp.blogspot.com/-b5Z0llS4T30/UYvbOzE5PVI/AAAAAAAABD4/YzzlFHzYD1o/s400/redline4.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
This figure shows the predictive distribution of wait times if you arrive and find 15 passengers on the platform. &amp;nbsp;Since we don't know the passenger arrival rate, we have to estimate it. &amp;nbsp;Each possible arrival rate yields one of the light blue lines; the dark blue line is the weighted mixture of the light blue lines.&lt;br /&gt;
&lt;br /&gt;
So in this scenario, you expect the next train in 5 minutes or less, with 80% confidence.&lt;br /&gt;
&lt;br /&gt;
UPDATE 10 May 2013: I got the following note from developer@mbta.com, confirming that their reported gap between trains is deliberately conservative:&lt;br /&gt;
&lt;br /&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;i&gt;Thank you for writing to let us know about the Red Line case study in your book, and thank you for your question. You’re right that the scheduled time between trains listed on the subway schedule card for rush hour is greater than what you observed at Kendall Square. It’s meant as a slightly conservative simplification of the actual frequency of trains, which varies by time throughout rush hour – to provide maximum capacity during the very peak of rush hour when ridership is normally highest – as well as by location along the Red Line during those different times, since when trains begin to leave more frequently from Alewife it takes time for that frequency to “travel” down the line. So yes it is meant to be slightly conservative for that reason. We hope this information answers your question.&lt;br /&gt;&lt;br /&gt;Sincerely,&lt;br /&gt;developer@mbta.com&lt;/i&gt;&lt;/blockquote&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/KOkjuBPOPDQ" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/3870167850612108696/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/05/the-red-line-problem.html#comment-form" title="4 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/3870167850612108696?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/3870167850612108696?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/KOkjuBPOPDQ/the-red-line-problem.html" title="The Red Line problem" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/-b5Z0llS4T30/UYvbOzE5PVI/AAAAAAAABD4/YzzlFHzYD1o/s72-c/redline4.png" height="72" width="72" /><thr:total>4</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/05/the-red-line-problem.html</feedburner:origLink></entry><entry gd:etag="W/&quot;AkEBRXY8fip7ImA9WhBUGEs.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-2138959314852224027</id><published>2013-05-06T12:28:00.001-07:00</published><updated>2013-05-06T12:30:54.876-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-05-06T12:30:54.876-07:00</app:edited><title>Software engineering practices for graduate students</title><content type="html">Recently I was talking with an Olin student who will start graduate school in the fall, and I suggested a few things I wish I had done in grad school. &amp;nbsp;And then I thought I should write them down. &amp;nbsp;So here is my list of &lt;b&gt;Software Engineering Practices All Graduate Students Should Adopt&lt;/b&gt;:&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;&lt;a href="http://en.wikipedia.org/wiki/Revision_control"&gt;Version Control&lt;/a&gt;&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Every keystroke you type should be under version control from the time you initiate a project until you retire it. &amp;nbsp;Here are the reasons:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
1) Everything you do will be backed up. &amp;nbsp;But instead of organizing your backups by date (which is what most backup systems do) they are organized by revision. &amp;nbsp;So, for example, if you break something, you can roll back to an earlier working revision.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
2) When you are collaborating with other people, you can share repositories. &amp;nbsp;Version control systems are well designed for managing this kind of collaboration. &amp;nbsp;If you are emailing documents back and forth, you are doing it wrong.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
3) At various stages of the project, you can save a tagged copy of the repo. &amp;nbsp;For example, when you submit a paper for publication, make a tagged copy. &amp;nbsp;You can keep working on the trunk, and when you get reviewer comments (or a question 5 years later) you have something to refer back to.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
I use Subversion (SVN) primarily, so I keep many of my projects on Google Code (if they are open source) or on my own SVN server. &amp;nbsp;But these days it seems like all the cool kids are using git and keeping their repositories on github.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Either way, find a version control system you like, learn how to use it, and find someplace to host your repository.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;&lt;a href="http://en.wikipedia.org/wiki/Build_automation"&gt;Build Automation&lt;/a&gt;&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
This goes hand in hand with version control. &amp;nbsp;If someone checks out your repository, they should be able to rebuild your project by running a single command. &amp;nbsp;That means that everything someone needs to replicate your results should be in the repo, and you should have scripts that process the data, generate figures and tables, and integrate them into your papers, slides, and other documents.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
One simple tool for automating the build is Make. &amp;nbsp;Every directory in your project should contain a Makefile. &amp;nbsp;The top-level directory should contain the Makefile that runs all the others.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
If you use GUI-based tools to process data, it might not be easy to automate your build. &amp;nbsp;But it will be worth it. &amp;nbsp;The night before your paper is due, you will find a bug somewhere in your data flow. &amp;nbsp;If you've done things right, you should be able to rebuild the paper with just five keystrokes (m-a-k-e, and Enter).&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Also, put a README in the top-level directory that documents the directory structure and the build process. &amp;nbsp;If your build depends on other software, include it in the repo if practical; otherwise provide a list of required packages.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Or, if your software environment is not easy to replicate, put your whole development environment in a virtual machine and ship the VM.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;&lt;a href="http://en.wikipedia.org/wiki/Agile_software_development"&gt;Agile Development&lt;/a&gt;&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
For many people, the most challenging part of grad school is time management. &amp;nbsp;If you are an undergraduate taking 4-5 classes, you can do deadline-driven scheduling; that is, you can work on whatever task is due next and you will probably get everything done on time.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
In grad school, you have more responsibility for how you spend your time and fewer deadlines to guide you. &amp;nbsp;It is easy to lose track of what you are doing, waste time doing things that are not important (see &lt;a href="http://en.wiktionary.org/wiki/yak_shaving"&gt;Yak Shaving&lt;/a&gt;), and neglect the things that move you toward the goal of graduation.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
One of the purposes of agile development tools is to help people decide what to do next. &amp;nbsp;They provide several features that apply to grad school as well as software development:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
1) They encourage planners to divide large tasks into smaller tasks that have a clearly-defined end condition.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
2) They maintain a priority-ranking of tasks so that when you complete one you can start work on the next, or one of the next few.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
3) They provide mechanisms for collaborating with a team and for getting feedback from an adviser.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
4) They involve planning on at least two time scales. &amp;nbsp;On a daily basis you decide what to work on by selecting tasks from the backlog. &amp;nbsp;On a weekly (or longer) basis, you create and reorder tasks, and decide which ones you should work on during the next cycle.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
If you use github or Google code for version control, you get an issue tracker as part of the deal. &amp;nbsp;You can use issue trackers for agile planning, but there are other tools, like Pivotal Tracker, that have more of the agile methodology built in. &amp;nbsp;I suggest you start with Pivotal Tracker because it has excellent documentation, but you might have to try out a few tools to find one you like.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Do these things -- Version Control, Build Automation, and Agile Development -- and you will get through grad school in less than the average time, with less than the average drama.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/16B2seXKV7Y" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/2138959314852224027/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/05/software-engineering-practices-for.html#comment-form" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/2138959314852224027?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/2138959314852224027?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/16B2seXKV7Y/software-engineering-practices-for.html" title="Software engineering practices for graduate students" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><thr:total>3</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/05/software-engineering-practices-for.html</feedburner:origLink></entry><entry gd:etag="W/&quot;C0QDQ3o8cSp7ImA9WhBVGEw.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-8340678536826082311</id><published>2013-04-24T06:49:00.000-07:00</published><updated>2013-04-24T06:49:32.479-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-04-24T06:49:32.479-07:00</app:edited><title>The Price is Right Problem: Part Two</title><content type="html">&lt;br /&gt;
This article is an excerpt from&amp;nbsp;&lt;i&gt;Think Bayes&lt;/i&gt;, a book I am working on. &amp;nbsp;The entire current draft is available from&amp;nbsp;&lt;a href="http://thinkbayes.com/"&gt;http://thinkbayes.com&lt;/a&gt;. &amp;nbsp;I welcome comments and suggestions.&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
In &lt;a href="http://allendowney.blogspot.com/2013/04/the-price-is-right-problem.html"&gt;the previous article&lt;/a&gt;, I described presented &lt;i&gt;The Price is Right&lt;/i&gt; problem and a Bayesian approach to estimating the value of a showcase of prizes. &amp;nbsp;This article picks up from there...&lt;/div&gt;
&lt;br /&gt;
&lt;span style="font-size: large;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;
&lt;span style="font-size: large;"&gt;&lt;b&gt;Optimal bidding&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Now that we have a posterior distribution, we can use it to
compute the optimal bid, which I define as the bid that maximizes
expected gain.&lt;br /&gt;
&lt;br /&gt;
To compute optimal bids, I wrote a class called &lt;tt&gt;GainCalculator&lt;/tt&gt;:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;&lt;span style="color: #38761d;"&gt;class GainCalculator(object):

    def __init__(self, player, opponent):
        self.player = player
        self.opponent = opponent&lt;/span&gt;
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;tt&gt;player&lt;/tt&gt; and &lt;tt&gt;opponent&lt;/tt&gt; are &lt;tt&gt;Player&lt;/tt&gt; objects.&lt;br /&gt;
&lt;br /&gt;
&lt;tt&gt;GainCalculator&lt;/tt&gt; provides &lt;tt&gt;ExpectedGains&lt;/tt&gt;, which
computes a sequence of bids and the expected gain for each
bid:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;    &lt;span style="color: #38761d;"&gt;def ExpectedGains(self, low=0, high=75000, n=101):
        bids = numpy.linspace(low, high, n)

        gains = [self.ExpectedGain(bid) for bid in bids]

        return bids, gains&lt;/span&gt;
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;tt&gt;low&lt;/tt&gt; and &lt;tt&gt;high&lt;/tt&gt; specify the range of possible bids;
&lt;tt&gt;n&lt;/tt&gt; is the number of bids to try. Here is the function
that computes expected gain for a given bid:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;    &lt;span style="color: #38761d;"&gt;def ExpectedGain(self, bid):
        suite = self.player.posterior
        total = 0
        for price, prob in suite.Items():
            gain = self.Gain(bid, price)
            total += prob * gain
        return total&lt;/span&gt;
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;tt&gt;ExpectedGain&lt;/tt&gt; loops through the values in the posterior
and computes the gain for each bid, given the actual prices of
the showcase. It weights each gain with the corresponding
probability and returns the total.&lt;br /&gt;
&lt;tt&gt;&lt;br /&gt;&lt;/tt&gt;
&lt;tt&gt;Gain&lt;/tt&gt; takes a bid and an actual price and returns 
the expected gain:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;    &lt;span style="color: #38761d;"&gt;def Gain(self, bid, price):
        if bid &amp;gt; price:
            return 0

        diff = price - bid
        prob = self.ProbWin(diff)

        if diff &amp;lt;= 250:
            return 2 * price * prob
        else:
            return price * prob
&lt;/span&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
If you overbid, you get nothing. Otherwise we compute 
the difference between your bid and the price, which determines
your probability of winning.&lt;br /&gt;
&lt;br /&gt;
If &lt;tt&gt;diff&lt;/tt&gt; is less than $250, you win both showcases. For
simplicity, I assume that both showcases have the same price. Since
this outcome is rare, it doesn’t make much difference.&lt;br /&gt;
&lt;br /&gt;
Finally, we have to compute the probability of winning based
on &lt;tt&gt;diff&lt;/tt&gt;:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;    &lt;span style="color: #38761d;"&gt;def ProbWin(self, diff):
        prob = (self.opponent.ProbOverbid() + 
                self.opponent.ProbWorseThan(diff))
        return prob&lt;/span&gt;
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
If your opponent overbids, you win. Otherwise, you have to hope
that your opponent is off by more than &lt;tt&gt;diff&lt;/tt&gt;. &lt;tt&gt;Player&lt;/tt&gt;
provides methods to compute both probabilities:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;&lt;span style="color: #38761d;"&gt;# class Player:

    def ProbOverbid(self):
        return self.cdf_diff.Prob(-1)

    def ProbWorseThan(self, diff):
        return 1 - self.cdf_diff.Prob(diff)&lt;/span&gt;
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
This code might be confusing because the computation is now from
the point of view of the opponent, who is computing, “What is
the probability that I overbid?” and “What is the probability
that my bid is off by more than &lt;tt&gt;diff&lt;/tt&gt;?”&lt;br /&gt;
&lt;br /&gt;
Both answers are based on the CDF of &lt;tt&gt;diff&lt;/tt&gt;&amp;nbsp;[&lt;a href="http://www.greenteapress.com/thinkbayes/html/thinkbayes004.html#toc22"&gt;CDFs are described here&lt;/a&gt;]. &amp;nbsp;If your opponent’s
&lt;tt&gt;diff&lt;/tt&gt; is less than or equal to -1, you win. If your opponent’s
&lt;tt&gt;diff&lt;/tt&gt; is worse than yours, you win. Otherwise you lose.&lt;br /&gt;
&lt;br /&gt;
Finally, here’s the code that computes optimal bids:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;&lt;span style="color: #38761d;"&gt;# class Player:

    def OptimalBid(self, guess, opponent):
        self.MakeBeliefs(guess)
        calc = GainCalculator(self, opponent)
        bids, gains = calc.ExpectedGains()
        gain, bid = max(zip(gains, bids))
        return bid, gain&lt;/span&gt;
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
Given a guess and an opponent, &lt;tt&gt;OptimalBid&lt;/tt&gt; computes
the posterior distribution, instantiates a &lt;tt&gt;GainCalculator&lt;/tt&gt;,
computes expected gains for a range of bids and returns
the optimal bid and expected gain. Whew!&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;blockquote class="figure"&gt;
&lt;div class="center"&gt;
&lt;br class="Apple-interchange-newline" /&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;div class="center"&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-2WRT18H1eog/UXfhdweDVgI/AAAAAAAABDY/LgBiODDR1zM/s1600/price5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://1.bp.blogspot.com/-2WRT18H1eog/UXfhdweDVgI/AAAAAAAABDY/LgBiODDR1zM/s400/price5.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="caption"&gt;
&lt;table cellpadding="0" cellspacing="6" style="text-align: center;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: left;" valign="top"&gt;&lt;div style="text-align: center;"&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Figure 6.4&lt;/div&gt;
&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;a href="http://www.blogger.com/blogger.g?blogID=6894866515532737257" name="fig.price5"&gt;&lt;/a&gt;&lt;br /&gt;
&lt;div class="center"&gt;
&lt;hr size="2" style="text-align: center;" width="80%" /&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;br /&gt;
&lt;br /&gt;
Figure&amp;nbsp;&lt;a href="http://www.blogger.com/blogger.g?blogID=6894866515532737257#fig.price5"&gt;6.4&lt;/a&gt; shows the results for both players,
based on a scenario where Player 1’s best guess is $20,000
and Player 2’s best guess is $40,000.&lt;br /&gt;
&lt;br /&gt;
For Player 1 the optimal bid is $21,000, yielding an expected
return of almost $16,700. This is a case (which turns out
to be unusual) where the optimal bid is actually higher than
the contestant’s best guess.&lt;br /&gt;
&lt;br /&gt;
For Player 2 the optimal bid is $31,500, yielding an expected
return of almost $19,400. This is the more typical case where
the optimal bid is less than the best guess.&lt;br /&gt;
&lt;br /&gt;
&lt;span style="font-size: large;"&gt;&lt;b&gt;Discussion&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-size: large;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;
One of the most useful features of Bayesian estimation is that the
result comes in the form of a posterior distribution. Classical
estimation usually generates a single point estimate or a confidence
interval, which is sufficient if estimation is the last step in the
process, but if you want to use an estimate as an input to a
subsequent analysis, point estimates and intervals are often not much
help.&lt;br /&gt;
&lt;br /&gt;
In this example, the Bayesian analysis yields a posterior distribution
we can use to compute an optimal bid. The gain function is asymmetric
and discontinuous (if you overbid, you lose), so it would be hard to
solve this problem analytically. But it is relatively simple to do
computationally.&lt;br /&gt;
&lt;br /&gt;
Newcomers to Bayesian thinking are often tempted to summarize the
posterior distribution by computing the mean or the maximum
likelihood estimate. These summaries can be useful, but if that’s
all you need, then you probably don’t need Bayesian methods in the
first place.&lt;br /&gt;
&lt;br /&gt;
Bayesian methods are most useful when you can carry the posterior
distribution into the next step of the process to perform some
kind of optimization, as we did in this chapter, or some kind of
prediction, as we will see in the next chapter [&lt;a href="http://www.greenteapress.com/thinkbayes/html/thinkbayes008.html"&gt;which you can read here&lt;/a&gt;].&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/UDB0hCM2mr0" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/8340678536826082311/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/04/the-price-is-right-problem-part-two.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/8340678536826082311?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/8340678536826082311?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/UDB0hCM2mr0/the-price-is-right-problem-part-two.html" title="The Price is Right Problem: Part Two" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/-2WRT18H1eog/UXfhdweDVgI/AAAAAAAABDY/LgBiODDR1zM/s72-c/price5.png" height="72" width="72" /><thr:total>0</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/04/the-price-is-right-problem-part-two.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CU4ERXg-eSp7ImA9WhBVGEQ.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-8682325935361655867</id><published>2013-04-22T11:36:00.001-07:00</published><updated>2013-04-25T05:45:04.651-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-04-25T05:45:04.651-07:00</app:edited><title>The Price is Right Problem</title><content type="html">This article is an excerpt from &lt;i&gt;Think Bayes&lt;/i&gt;, a book I am working on. &amp;nbsp;The entire current draft is available from &lt;a href="http://thinkbayes.com/"&gt;http://thinkbayes.com&lt;/a&gt;. &amp;nbsp;I welcome comments and suggestions.&lt;br /&gt;
&lt;b&gt;&lt;span style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;
&lt;b&gt;&lt;span style="font-size: large;"&gt;&lt;i&gt;The Price is Right&lt;/i&gt; Problem&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
On November 1, 2007, contestants named Letia and Nathaniel appeared
on &lt;i&gt;The Price is Right&lt;/i&gt;, an American game show. They competed in
a game called the Showcase, where the objective is to guess the price
of a showcase of prizes. The contestant who comes closest to the
actual price of the showcase, without going over, wins the prizes.&lt;br /&gt;
&lt;br /&gt;
Nathaniel went first. His showcase included a dishwasher, a wine
cabinet, a laptop computer, and a car. He bid $26,000.&lt;br /&gt;
&lt;br /&gt;
Letia’s showcase included a pinball machine, a video arcade game, a
pool table, and a cruise of the Bahamas. She bid $21,500.&lt;br /&gt;
&lt;br /&gt;
The actual price of Nathaniel’s showcase was $25,347. His bid
was too high, so he lost.&lt;br /&gt;
&lt;br /&gt;
The actual price of Letia’s showcase was $21,578. She was only
off by $78, so she won her showcase and, because
her bid was off by less than $250, she also won Nathaniel’s
showcase.&lt;br /&gt;
&lt;br /&gt;
For a Bayesian thinker, this scenario suggests several questions:&lt;br /&gt;
&lt;ol class="enumerate" type="1"&gt;
&lt;li class="li-enumerate"&gt;Before seeing the prizes, what prior beliefs should the
contestant have about the price of the showcase?&lt;/li&gt;
&lt;li class="li-enumerate"&gt;After seeing the prizes, how should the contestant update
those prior beliefs?&lt;/li&gt;
&lt;li class="li-enumerate"&gt;Based on the posterior distribution, what should the
contestant bid?&lt;/li&gt;
&lt;/ol&gt;
The third question demonstrates a common use of Bayesian analysis:
optimization. Given a posterior distribution, we can choose
the bid that maximizes the contestant’s expected return.&lt;br /&gt;
&lt;br /&gt;
This problem is inspired by an example in Cameron Davidson-Pilon’s
book, &lt;i&gt;Bayesian Methods for Hackers&lt;/i&gt;.&lt;br /&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/h2&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;&lt;a href="http://www.blogger.com/blogger.g?blogID=6894866515532737257" name="toc36"&gt;&lt;/a&gt;The prior&lt;/span&gt;&amp;nbsp;&lt;/h2&gt;
&lt;div class="separator" style="clear: both; text-align: left;"&gt;
To choose a prior distribution of prices, we can take advantage
of data from previous episodes. Fortunately, fans of the show
keep detailed records. When I corresponded with Mr.&amp;nbsp;Davidson-Pilon
about his book, he sent me data collected by Steve Gee at
&lt;tt&gt;&lt;a href="http://tpirsummaries.8m.com/"&gt;http://tpirsummaries.8m.com&lt;/a&gt;&lt;/tt&gt;. It includes the price of
each showcase from the 2011 and 2012 seasons, and the bids
offered by the contestants.&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-8gpay15rCVc/UXWAd2k0IWI/AAAAAAAABC4/KDePtDVKh4o/s1600/price1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://2.bp.blogspot.com/-8gpay15rCVc/UXWAd2k0IWI/AAAAAAAABC4/KDePtDVKh4o/s400/price1.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;blockquote class="figure"&gt;
&lt;div class="caption"&gt;
&lt;table cellpadding="0" cellspacing="6"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align="left" valign="top"&gt;Figure 6.1: Distribution of prices for showcases on&amp;nbsp;&lt;i&gt;The Price is Right&lt;/i&gt;, 2011-12.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;a href="http://www.blogger.com/blogger.g?blogID=6894866515532737257" name="fig.price1"&gt;&lt;/a&gt;&lt;br /&gt;
&lt;div class="center"&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;br /&gt;
Figure&amp;nbsp;&lt;a href="http://www.blogger.com/blogger.g?blogID=6894866515532737257#fig.price1"&gt;6.1&lt;/a&gt; shows the distribution of prices for these
showcases. The most common value for both showcases is around
$28,000, but the first showcase has a second mode near $50,000,
and the second showcase is occasionally worth more than $70,000.&lt;br /&gt;
&lt;br /&gt;
These distributions are based on actual data, but they
have been smoothed by Gaussian kernel density estimation (KDE).
So before we go on, I want to take a detour to talk about 
probability density functions and KDE.&lt;br /&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/h2&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;Probability density functions&lt;/span&gt;&lt;/h2&gt;
So far [in &lt;i&gt;Think Bayes&lt;/i&gt;, that is] we have been working with probability mass functions, or PMFs.
A PMF is a mapping from each possible value to its probability. In my
implementation, a &lt;tt&gt;Pmf&lt;/tt&gt; object provides a method named &lt;tt&gt;Prob&lt;/tt&gt; that
takes a value and returns a probability, also known as a “probability
mass.”&lt;br /&gt;
&lt;br /&gt;
A probability density function, or PDF, is the continuous version of a
PMF, where the possible values make up a continuous range rather than
a discrete set. &lt;br /&gt;
&lt;br /&gt;
In mathematical notation, PDFs are usually written as functions; for
example, here is the PDF of a Gaussian distribution with
mean 0 and standard deviation 1:
&lt;br /&gt;
&lt;table class="display dcenter"&gt;&lt;tbody&gt;
&lt;tr valign="middle"&gt;&lt;td class="dcell"&gt;&lt;i&gt;f&lt;/i&gt;(&lt;i&gt;x&lt;/i&gt;)&amp;nbsp;=&amp;nbsp;exp(−&lt;i&gt;x&lt;/i&gt;&lt;sup&gt;2&lt;/sup&gt;)&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;br /&gt;
To represent this PDF in Python, I could define a class like this:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;class StandardGaussianPdf(object):

    def Density(self, x):
        return math.exp(-x**2)
&lt;/pre&gt;
&lt;tt&gt;&lt;br /&gt;&lt;/tt&gt;
&lt;tt&gt;Density&lt;/tt&gt; takes a value, &lt;tt&gt;x&lt;/tt&gt;, and returns the probability
density evaluated at &lt;tt&gt;x&lt;/tt&gt;. &lt;br /&gt;
&lt;br /&gt;
A probability density is similar
to a probability mass in one way: higher density indicates that a
value is more likely.&lt;br /&gt;
&lt;br /&gt;
But a density is not a probability. If you integrate a density
over a continuous range, the result is a probability. But 
for the applications in this book we seldom have to do that.&lt;br /&gt;
&lt;br /&gt;
In this book we primarily use probability densities as part
of a &lt;tt&gt;Likelihood&lt;/tt&gt; function. We will see an example soon.&lt;br /&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/h2&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;Representing Pdfs&lt;/span&gt;&lt;/h2&gt;
Before we get back to &lt;i&gt;The Price is Right&lt;/i&gt;, I want to
present a more general way to represent PDFs.&lt;br /&gt;
&lt;tt&gt;thinkbayes.py&lt;/tt&gt; provides a class named &lt;tt&gt;Pdf&lt;/tt&gt; that defines
two functions, &lt;tt&gt;Density&lt;/tt&gt; and &lt;tt&gt;MakePmf&lt;/tt&gt;:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;class Pdf(object):

    def Density(self, x):
        raise UnimplementedMethodException()

    def MakePmf(self, xs):
        pmf = Pmf()
        for x in xs:
            pmf.Set(x, self.Density(x))
        pmf.Normalize()
        return pmf
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;tt&gt;Pdf&lt;/tt&gt; is an &lt;b&gt;abstract type&lt;/b&gt;, which means that it defines
the interface a Pdf is supposed to have, but does not provide
a complete implementation. Specifically, &lt;tt&gt;Pdf&lt;/tt&gt; provides
&lt;tt&gt;MakePmf&lt;/tt&gt; but not &lt;tt&gt;Density&lt;/tt&gt;. &amp;nbsp;[PMFs are described in &lt;a href="http://www.greenteapress.com/thinkbayes/html/thinkbayes003.html#toc10"&gt;Chapter 2 of Think Bayes&lt;/a&gt;.]&lt;br /&gt;
&lt;br /&gt;
A &lt;b&gt;concrete type&lt;/b&gt; is a class that extends an abstract parent
class and provides an implementation of the missing methods.&lt;br /&gt;
&lt;br /&gt;
For example, &lt;tt&gt;GaussianPdf&lt;/tt&gt; extends &lt;tt&gt;Pdf&lt;/tt&gt; and provides
&lt;tt&gt;Density&lt;/tt&gt;:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;class GaussianPdf(Pdf):

    def __init__(self, mu, sigma):
        self.mu = mu
        self.sigma = sigma
        
    def Density(self, x):
        density = scipy.stats.norm.pdf(x, 
                                       loc=self.mu, 
                                       scale=self.sigma)
        return density
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;code&gt;__init__&lt;/code&gt; takes &lt;tt&gt;mu&lt;/tt&gt; and &lt;tt&gt;sigma&lt;/tt&gt;, which are
the mean and standard deviation of the distribution.&lt;br /&gt;
&lt;tt&gt;Density&lt;/tt&gt; uses a function from &lt;tt&gt;scipy&lt;/tt&gt; to evaluate the
Gaussian PDF.&lt;br /&gt;
&lt;br /&gt;
The Gaussian PDF is defined by a simple mathematical function,
so it is easy to evaluate. And it is useful because many
quantities in the real world have distributions that are
approximately Gaussian.&lt;br /&gt;
But with real data, there is no guarantee that the PDF
is Gaussian, or any other simple mathematical function. In
that case we can use a data sample to estimate the PDF of
the whole population.&lt;br /&gt;
&lt;br /&gt;
For example, in &lt;i&gt;The Price Is Right&lt;/i&gt; data, we have
313 prices for the first showcase. We can think of these
values as a sample from the population of all possible showcase
prices.&lt;br /&gt;
&lt;br /&gt;
Near the middle of the distribution, we see the following values:
&lt;br /&gt;
&lt;br /&gt;
&lt;table class="display dcenter"&gt;&lt;tbody&gt;
&lt;tr valign="middle"&gt;&lt;td class="dcell"&gt;28800,&amp;nbsp;28868,&amp;nbsp;28941,&amp;nbsp;28957,&amp;nbsp;28958&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;br /&gt;
In the sample, no values appear between 28801 and 28867, but
there is no reason to think that these values are impossible.
Based on our background information, we would expect all
values in this range to be equally likely. In other words,
we expect the PDF to be reasonably smooth.&lt;br /&gt;
&lt;br /&gt;
Kernel density estimation (KDE) is an algorithm that takes
a sample of values and finds an appropriately-smooth PDF that fits 
the data. You can read about the details at
&lt;tt&gt;&lt;a href="http://en.wikipedia.org/wiki/Kernel_density_estimation"&gt;http://en.wikipedia.org/wiki/Kernel_density_estimation&lt;/a&gt;&lt;/tt&gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;tt&gt;scipy&lt;/tt&gt; provides an implementation of KDE. &lt;tt&gt;thinkbayes&lt;/tt&gt;
provides a class called &lt;tt&gt;EstimatedPdf&lt;/tt&gt; that extends &lt;tt&gt;Pdf&lt;/tt&gt;
and uses KDE:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;class EstimatedPdf(Pdf):

    def __init__(self, sample):
        xs = numpy.array(sample, dtype=numpy.double)
        self.kde = scipy.stats.gaussian_kde(xs)

    def Density(self, x):
        return self.kde.evaluate(x)
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;code&gt;__init__&lt;/code&gt; takes a sample, converts it to a NumPy array,
and computes a kernel density estimate. The result is a
&lt;code&gt;gaussian_kde&lt;/code&gt; object that provides an &lt;tt&gt;evaluate&lt;/tt&gt;
method.&lt;br /&gt;
&lt;br /&gt;
&lt;tt&gt;Density&lt;/tt&gt; takes a value, calls &lt;code&gt;gaussian_kde.evaluate&lt;/code&gt;,
and returns the resulting density.&lt;br /&gt;
Finally, here’s an outline of the code I used to generate
Figure&amp;nbsp;&lt;a href="http://www.blogger.com/blogger.g?blogID=6894866515532737257#fig.price1"&gt;6.1&lt;/a&gt;:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;    prices = ReadData()
    kde = thinkbayes.EstimatedPdf(prices)

    low, high = 0, 75000
    n = 101
    xs = numpy.linspace(low, high, n) 
    pmf = kde.MakePmf(xs)

    myplot.Pmf(pmf)
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
And now back to &lt;i&gt;The Price is Right&lt;/i&gt;.&lt;br /&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/h2&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;&lt;a href="http://www.blogger.com/blogger.g?blogID=6894866515532737257" name="toc39"&gt;&lt;/a&gt;Modeling the contestants&lt;/span&gt;&lt;/h2&gt;
&lt;blockquote class="figure"&gt;
&lt;div class="center"&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
The PDFs in Figure&amp;nbsp;&lt;a href="http://www.blogger.com/blogger.g?blogID=6894866515532737257#fig.price1"&gt;6.1&lt;/a&gt; estimate the
distribution of possible prices for each showcase.
If you were a contestant on the show, you could use this
distribution to quantify
your prior belief about the price of the showcases
(before you see the prizes).&lt;br /&gt;
&lt;br /&gt;
To update these priors, you have to answer these questions:&lt;br /&gt;
&lt;ol class="enumerate" type="1"&gt;
&lt;li class="li-enumerate"&gt;What data should we consider and how should we quantify it?&lt;/li&gt;
&lt;li class="li-enumerate"&gt;Can we compute a &lt;tt&gt;Likelihood&lt;/tt&gt; function; that is,
for each hypothetical value of &lt;tt&gt;price&lt;/tt&gt;, can we compute
the conditional likelihood of the data?&lt;/li&gt;
&lt;/ol&gt;
To answer these questions, I am going to model the contestant
as a price-guessing instrument with known error characteristics.
In other words, when the contestant sees the prizes, he or she
guesses the price of each prize—ideally without taking into
consideration the fact that the prize is part of a showcase—and
adds up the prices. Let’s call this total &lt;tt&gt;guess&lt;/tt&gt;.&lt;br /&gt;
&lt;br /&gt;
Under this model, the question we have to answer is, “If the
actual price is &lt;tt&gt;price&lt;/tt&gt;, what is the likelihood that the
contestant’s total estimate would be &lt;tt&gt;guess&lt;/tt&gt;?”&lt;br /&gt;
&lt;br /&gt;
Or if we define
&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;    error = price - guess
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
then we could ask, “What is the likelihood
that the contestant’s estimate is off by &lt;tt&gt;error&lt;/tt&gt;?”&lt;br /&gt;
To answer this question, we can use the historical data again.
Figure&amp;nbsp;&lt;a href="http://www.blogger.com/blogger.g?blogID=6894866515532737257#fig.price2"&gt;6.2&lt;/a&gt; shows the cumulative distribution of &lt;tt&gt;diff&lt;/tt&gt;,
the difference between the contestant’s bid and the actual price
of the showcase.&lt;br /&gt;
The definition of diff is
&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;    diff = price - bid
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
When &lt;tt&gt;diff&lt;/tt&gt; is negative, the bid is too high. As an
aside, we can use this CDF to compute the probability that the
contestants overbid: the first contestant overbids 25% of the
time; the second contestant overbids 29% of the time.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;blockquote class="figure"&gt;
&lt;div class="center"&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;div class="center"&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-QXeTAoqXa4I/UXWAlVyUJNI/AAAAAAAABDA/Y1cAtwP-Ycs/s1600/price2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://1.bp.blogspot.com/-QXeTAoqXa4I/UXWAlVyUJNI/AAAAAAAABDA/Y1cAtwP-Ycs/s400/price2.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="caption"&gt;
&lt;table cellpadding="0" cellspacing="6"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align="left" valign="top"&gt;Figure 6.2: Cumulative distribution (CDF) of the difference between the contestant’s bid and the actual price.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;a href="http://www.blogger.com/blogger.g?blogID=6894866515532737257" name="fig.price2"&gt;&lt;/a&gt;&lt;br /&gt;
&lt;div class="center"&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;br /&gt;
&lt;br /&gt;
We can also use this distribution to estimate the reliability of
the contestants’ guesses. This step is a little tricky because
we don’t actually know the contestant’s guesses; we only know
what they bid.&lt;br /&gt;
In Figure&amp;nbsp;&lt;a href="http://www.blogger.com/blogger.g?blogID=6894866515532737257#fig.price2"&gt;6.2&lt;/a&gt; we can see that the bids are biased;
that is, they are more likely to be too low than too high. And
that makes sense, given the rules of the game.&lt;br /&gt;
&lt;br /&gt;
So we’ll have to make some assumptions. Specifically, I assume
that the distribution of &lt;tt&gt;error&lt;/tt&gt; is Gaussian with mean 0
and the same variance as &lt;tt&gt;diff&lt;/tt&gt;.&lt;br /&gt;
&lt;br /&gt;
The &lt;tt&gt;Player&lt;/tt&gt; class implements this model:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;class Player(object):

    def __init__(self, price, bid, diff):
        self.price = price
        self.bid = bid
        self.diff = diff

        self.pdf_price = thinkbayes.EstimatedPdf(price)
        self.cdf_diff = thinkbayes.MakeCdfFromList(diff)

        mu = 0
        sigma = numpy.std(self.diff)
        self.pdf_error = thinkbayes.GaussianPdf(mu, sigma)
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;tt&gt;price&lt;/tt&gt; is a sequence of showcase prices, &lt;tt&gt;bid&lt;/tt&gt; is a
sequence of bids, and &lt;tt&gt;diff&lt;/tt&gt; is a sequence of diffs, where
again &lt;tt&gt;diff = price - bid&lt;/tt&gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;pdf_price&lt;/code&gt; is the smoothed PDF of prices, estimated by KDE.
&lt;code&gt;cdf_diff&lt;/code&gt; is the cumulative distribution of &lt;tt&gt;diff&lt;/tt&gt;,
which we saw in Figure&amp;nbsp;&lt;a href="http://www.blogger.com/blogger.g?blogID=6894866515532737257#fig.price2"&gt;6.2&lt;/a&gt;. And &lt;code&gt;pdf_error&lt;/code&gt;
is the PDF that characterizes the distribution of errors; where
&lt;tt&gt;error = price - guess&lt;/tt&gt;.&lt;br /&gt;
&lt;br /&gt;
Again, we use the variance of &lt;tt&gt;diff&lt;/tt&gt; to estimate the variance of
&lt;tt&gt;error&lt;/tt&gt;. But contestant’s bids are sometimes strategic; for
example, if Player 2 thinks that Player 1 has overbid, Player 2 might
make a very low bid. In that case &lt;tt&gt;diff&lt;/tt&gt; does not reflect &lt;tt&gt;error&lt;/tt&gt;. If this strategy is common, the observed variance in &lt;tt&gt;diff&lt;/tt&gt; might overestimate the variance in &lt;tt&gt;error&lt;/tt&gt;. Nevertheless,
I think it is a reasonable modeling decision.&lt;br /&gt;
&lt;br /&gt;
As an alternative, someone preparing to appear on the show could
estimate their own distribution of &lt;tt&gt;error&lt;/tt&gt; by watching previous shows
and recording their guesses and the actual prices.&lt;br /&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/h2&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;Likelihood&lt;/span&gt;&lt;/h2&gt;
Now we are ready to write the likelihood function. As usual,
I define a new class that extends&amp;nbsp;&lt;tt&gt;thinkbayes.Suite&lt;/tt&gt;:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;class Price(thinkbayes.Suite):

    def __init__(self, pmf, player):
        thinkbayes.Suite.__init__(self)

        for price, prob in pmf.Items():
            self.Set(price, prob)

        self.player = player
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;tt&gt;pmf&lt;/tt&gt; represents the prior distribution. The &lt;tt&gt;for&lt;/tt&gt;
loop copies the values and probabilities from &lt;tt&gt;pmf&lt;/tt&gt; into
the new &lt;tt&gt;Suite&lt;/tt&gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;tt&gt;player&lt;/tt&gt; is a Player object as described in the previous
section. &amp;nbsp;And here’s &lt;tt&gt;Likelihood&lt;/tt&gt;:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;    def Likelihood(self, hypo, data):
        price = hypo
        guess = data

        error = price - guess
        like = self.player.ErrorDensity(error)

        return like
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;tt&gt;hypo&lt;/tt&gt; is the hypothetical price of the showcase. &lt;tt&gt;data&lt;/tt&gt;
is the contestant’s best guess at the price. &lt;tt&gt;error&lt;/tt&gt; is
the difference, and &lt;tt&gt;like&lt;/tt&gt; is the likelihood of the data,
given the hypothesis.&lt;br /&gt;
&lt;br /&gt;
&lt;tt&gt;ErrorDensity&lt;/tt&gt; is defined in &lt;tt&gt;Player&lt;/tt&gt;:&lt;br /&gt;
&lt;pre class="verbatim"&gt;# class Player:

    def ErrorDensity(self, error):
        return self.pdf_error.Density(error)
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;tt&gt;ErrorDensity&lt;/tt&gt; works by evaluating &lt;code&gt;pdf_error&lt;/code&gt; at
the given value of &lt;tt&gt;error&lt;/tt&gt;.&lt;br /&gt;
&lt;br /&gt;
The result is a probability density, which means we can’t treat it as
a probability. But remember that &lt;tt&gt;Likelihood&lt;/tt&gt; does not really
need to compute a probability; it only has to compute something &lt;em&gt;proportional&lt;/em&gt; to a probability. As long as the constant of
proportionality is the same for all likelihoods, it gets cancelled out
when we normalize the posterior distribution.&lt;br /&gt;
&lt;br /&gt;
And therefore, a probability density is a perfectly good likelihood.&lt;br /&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/h2&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;Update&lt;/span&gt;&lt;/h2&gt;
&lt;blockquote class="figure"&gt;
&lt;div class="center"&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;tt&gt;Player&lt;/tt&gt; provides a method that takes the contestant’s
guess and computes the posterior distribution:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;    def MakeBeliefs(self, guess):
        pmf = self.PmfPrice()
        self.prior = Price(pmf, self, name='prior')
        self.posterior = self.prior.Copy(name='posterior')
        self.posterior.Update(guess)
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;tt&gt;PmfPrice&lt;/tt&gt; evaluates &lt;code&gt;pdf_price&lt;/code&gt; at an equally-spaced
series of values:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;    def PmfPrice(self):
        return self.pdf_price.MakePmf(self.price_xs)
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
The result is a new &lt;tt&gt;Pmf&lt;/tt&gt; object, which we use to construct
the prior. To construct the posterior, we make a copy of the
prior and then invoke &lt;tt&gt;Update&lt;/tt&gt;, which invokes &lt;tt&gt;Likelihood&lt;/tt&gt;
for each hypothesis, multiplies the priors by the likelihoods,
and then renormalizes.&lt;br /&gt;
&lt;br /&gt;
So let’s get back to the original scenario. Suppose you are
Player 1 and when you see your showcase, your best guess is
that the total price of the prizes is $20,000.&lt;br /&gt;
&lt;br /&gt;
The following code constructs and plots your prior and
posterior beliefs about the actual price:&lt;br /&gt;
&lt;br /&gt;
&lt;pre class="verbatim"&gt;    player1.MakeBeliefs(20000)
    myplot.Pmf(player1.prior)
    myplot.Pmf(player2.prior)
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;blockquote class="figure" style="font-family: 'Times New Roman'; white-space: normal;"&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;/div&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;div class="center"&gt;
&lt;a href="http://2.bp.blogspot.com/-bFWt2tVWOCA/UXWBNPWVP7I/AAAAAAAABDI/ihbZT_wtmSI/s1600/price3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"&gt;&lt;img border="0" height="300" src="http://2.bp.blogspot.com/-bFWt2tVWOCA/UXWBNPWVP7I/AAAAAAAABDI/ihbZT_wtmSI/s400/price3.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="caption"&gt;
&lt;table cellpadding="0" cellspacing="6"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align="left" valign="top"&gt;Figure 6.3: Prior and posterior distributions for Player 1, based on a best guess of $20,000.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;a href="http://www.blogger.com/blogger.g?blogID=6894866515532737257" name="fig.price3"&gt;&lt;/a&gt;

&lt;br /&gt;
&lt;div class="center"&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;/pre&gt;
Figure&amp;nbsp;&lt;a href="http://www.blogger.com/blogger.g?blogID=6894866515532737257#fig.price3"&gt;6.3&lt;/a&gt; shows the results. The value of your guess
is on the low end of the prior range, so the posterior is shifted
to the left. The mean of the posterior is $25,096; the most
likely value is $24,000.&lt;br /&gt;
On one level, this result makes sense. The most likely value
in the prior is $27,750. Your best guess is $20,000. And the
most likely value in the posterior is about half way in between.&lt;br /&gt;
&lt;br /&gt;
On another level, you might find this result bizarre, because it
suggests that if you think the price is $20,000, then you
should believe the price is $24,000.&lt;br /&gt;
&lt;br /&gt;
To resolve this apparent paradox, remember that you are combining two
sources of information, historical data about past showcases and
guesses about the prizes you see.&lt;br /&gt;
&lt;br /&gt;
We are treating the historical data as the prior and updating it
based on your guesses, but we could equivalently use your guess
as a prior and update it based on historical data. &lt;br /&gt;
&lt;br /&gt;
If you think of it that way, maybe it is less surprising that the
most likely value in the posterior is not your original guess.&lt;br /&gt;
&lt;br /&gt;
In &lt;a href="http://allendowney.blogspot.com/2013/04/the-price-is-right-problem-part-two.html"&gt;the next installment&lt;/a&gt;, we'll use the posterior distribution to compute the optimal bid for each player.&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/_HkNJxIOLTU" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/8682325935361655867/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/04/the-price-is-right-problem.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/8682325935361655867?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/8682325935361655867?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/_HkNJxIOLTU/the-price-is-right-problem.html" title="The Price is Right Problem" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/-8gpay15rCVc/UXWAd2k0IWI/AAAAAAAABC4/KDePtDVKh4o/s72-c/price1.png" height="72" width="72" /><thr:total>0</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/04/the-price-is-right-problem.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CEEBQHw4fSp7ImA9WhBVFks.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-8372321236041352456</id><published>2013-04-11T08:55:00.001-07:00</published><updated>2013-04-22T13:30:51.235-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-04-22T13:30:51.235-07:00</app:edited><title>The price is right</title><content type="html">On the reddit statistics forum, I recently posted a link to &lt;a href="https://sites.google.com/site/simplebayes/home/pycon-2013"&gt;my tutorial on Bayesian statistics&lt;/a&gt;. &amp;nbsp;One of my fellow redditors drew my attention to &lt;a href="http://camdp.com/blogs/how-solve-price-rights-showdown"&gt;this article&lt;/a&gt;, which uses pymc to do a Bayesian analysis of &lt;a href="http://en.wikipedia.org/wiki/The_Price_Is_Right"&gt;&lt;i&gt;The Price is Right&lt;/i&gt;&lt;/a&gt;. &amp;nbsp;He or she asked how I would solve this problem using the framework in &lt;i&gt;&lt;a href="http://thinkbayes.com/"&gt;Think Bayes&lt;/a&gt;&lt;/i&gt;. &amp;nbsp;So, here goes.&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
First, we have to define a Suite of hypotheses. &amp;nbsp;In this example, each hypothesis represents a belief about the total price of the showcase. &amp;nbsp;We are told, based on data from previous shows, that a reasonable prior distribution is normal with mean 35000 dollars and standard deviation 7500.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
So here's the code that creates the Suite:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;class Price(thinkbayes.Suite):&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; def __init__(self, error_sigma):&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; """Constructs the suite.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; error_sigma: standard deviation of the distribution of error&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; """&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; thinkbayes.Suite.__init__(self)&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; pmf = thinkbayes.MakeGaussianPmf(35000, 7500, num_sigmas=4)&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; # copy items from pmf to self&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; for val, prob in pmf.Items():&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; self.Set(val, prob)&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; # store error_sigma for use in Likelihood&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; self.error_sigma = error_sigma&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
MakeGaussianPmf makes a Pmf (probability mass function) that approximates a normal distribution. &amp;nbsp;It truncates the range of the distribution 4 sigmas in each direction from the mean.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
I'll explain error_sigma in a minute.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Now that we have a suite of hypotheses, we think about how to represent the data. &amp;nbsp;In this case, the "data" is my best guess about the total value of the showcase, which we can think of as a measurement produced by a somewhat unreliable instrument, my brain.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
According to the problem statement, my best guess is 3000 for the price of the snowmobile, 12000 for the price of the trip, so 15000 for the total price.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Now we need a Likelihood function that takes a hypothetical price for the showcase, and my best guess, and returns the Likelihood of the data given the hypothesis. &amp;nbsp;That is, if the actual price of the showcase is X, what is the probability that I would guess 12000?&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
To answer that question, we need some information about how good I am at guessing prices, and that's where error_sigma comes in.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
We are told that my uncertainty about the price of the snowmobile can be captured by a normal distribution with sigma=500. &amp;nbsp;And my uncertainty about the price of the trip is normal with sigma=3000. &amp;nbsp;So let's compute the standard deviation of my total error:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; error_snowmobile = 500&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; error_trip = 3000&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; error_total = math.sqrt(error_snowmobile**2 + error_trip**2)&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Now we can create the Suite&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; suite = Price(error_total)&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
and update it with my guess&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; my_guess = 15000&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; suite.Update(my_guess)&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
When we invoke Update, it invokes Likelihood once for each hypothetical showcase price. &amp;nbsp;Here's the Likelihood function:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; def Likelihood(self, hypo, data):&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; """Computes the likelihood of the data under the hypothesis.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; hypo: actual price&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; data: my guess&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; """&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; actual_price = hypo&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; my_guess = data&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; error = my_guess - actual_price&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; like = thinkbayes.EvalGaussianPdf(&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; mu=0,&amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; sigma=self.error_sigma,&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; x=error)&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; return like&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;x&lt;/span&gt; is the error; that is, how much my guess is off by. &amp;nbsp;&lt;span style="font-family: Courier New, Courier, monospace; font-size: x-small;"&gt;like&lt;/span&gt; is the likelihood that I would be off by that much, computing by evaluating the density function of the Gaussian with sigma=error_sigma.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
And we're done. &amp;nbsp;Let's see what the results look like.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-2pkRtn4QY1w/UWbYb1pRzTI/AAAAAAAABCk/aVQjW4aGpVM/s1600/price1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://3.bp.blogspot.com/-2pkRtn4QY1w/UWbYb1pRzTI/AAAAAAAABCk/aVQjW4aGpVM/s400/price1.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The mean of the posterior distribution is 17800, substantially less than the prior mean (35000). &amp;nbsp;It is also substantially less than the result reported in the original article, near 28000. &amp;nbsp;So we are left with a suite of three hypotheses:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
1) There is a mistake in my implementation.&lt;/div&gt;
&lt;div&gt;
2) There is a mistake in the other author's implementation.&lt;/div&gt;
&lt;div&gt;
3) We are actually making subtly different modeling assumptions.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Option #3 is possible because I am not sure I understand the model in the original article. &amp;nbsp;The author describes my beliefs about the snowmobile and the trip as "priors", which suggests that they are going to get updated. &amp;nbsp;In contrast, I am treating my guess about the prices as data (that is, a summary of what I learned by seeing the contents of the showcase), but I am also modeling myself as a measurement instrument with a characteristic distribution of errors.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Under my interpretation, the posterior shown above makes sense. &amp;nbsp;For example, if my guess is 15000, and the standard deviation of my guesses is 3050, then it is very unlikely that I am off by 4 standard deviations, so the upper bound of the posterior should be around 15000&amp;nbsp;+ 4 * 3050 = 27200. &amp;nbsp;That makes the mean reported in the original article (28000) seem too high.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
But maybe I am not interpreting the statement of the problem (or the model) as intended. &amp;nbsp;I will check in with my correspondent and update this article when we have an explanation!&lt;br /&gt;
&lt;br /&gt;
UPDATE April 18, 2013: &amp;nbsp;I exchanged a few emails with the author of the original article, Cameron Davidson-Pilon. &amp;nbsp;He found a bug in his code that explains at least part of the difference between his results and mine. &amp;nbsp;So I think he is planning to update his article and the book he is working on.&lt;br /&gt;
&lt;br /&gt;
He also sent me some data on the value of recent showcases on &lt;i&gt;The Price is Right&lt;/i&gt; and the bids offered by the contestants. &amp;nbsp;The data were collected by Steve Gee and posted at &lt;a href="http://tpirsummaries.8m.com/"&gt;The Price is Right Stats&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
I have written code that uses this data to form the prior distribution of prices, and also to estimate the distribution of errors for the contestants. &amp;nbsp;And I am writing it up as a chapter in &lt;i&gt;&lt;a href="http://thinkbayes.com/"&gt;Think Bayes&lt;/a&gt;&lt;/i&gt;. &amp;nbsp;I'll post the new chapter here when it is done!&lt;br /&gt;
&lt;br /&gt;
UPDATE April 22, 2013: I have added a chapter to &lt;i&gt;Think Bayes&lt;/i&gt;, and I am publishing it as a two-part series, &lt;a href="http://allendowney.blogspot.com/2013/04/the-price-is-right-problem.html"&gt;starting here&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/gs20IrwUR6U" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/8372321236041352456/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/04/the-price-is-right.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/8372321236041352456?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/8372321236041352456?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/gs20IrwUR6U/the-price-is-right.html" title="The price is right" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://3.bp.blogspot.com/-2pkRtn4QY1w/UWbYb1pRzTI/AAAAAAAABCk/aVQjW4aGpVM/s72-c/price1.png" height="72" width="72" /><thr:total>0</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/04/the-price-is-right.html</feedburner:origLink></entry><entry gd:etag="W/&quot;A04MQXwyeCp7ImA9WhBWFU8.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-6755063576816842369</id><published>2013-04-09T10:53:00.000-07:00</published><updated>2013-04-09T10:53:00.290-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-04-09T10:53:00.290-07:00</app:edited><title>Freshman hordes regress to the mean</title><content type="html">&lt;div&gt;
&lt;b&gt;More nones, &lt;a href="http://www.latimes.com/news/custom/timespoll/la-940221nunpoll,0,3477112.story"&gt;no nuns&lt;/a&gt;&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
For several years I have been following one of the most under-reported stories of the decade: the fraction of college freshmen who report no religious preference has tripled since 1985, from 8% to 24%, and the trend is accelerating.&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
Two years ago I wrote&amp;nbsp;&lt;a href="http://allendowney.blogspot.com/2011/03/freshman-hordes-more-godless-than-ever.html"&gt;&lt;i&gt;Freshman hordes more godless than ever&lt;/i&gt;&lt;/a&gt;; last year I updated it with&amp;nbsp;&lt;i&gt;&lt;a href="http://allendowney.blogspot.com/2012/01/freshman-hordes-even-more-godless.html"&gt;Freshman hordes even more godless&lt;/a&gt;&lt;/i&gt;. &amp;nbsp;Each year, the number of students with no religious preference increased, and the number attending religious services decreased.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
In last year's installment, I made the bold prediction that the trend would continue, and that the students starting college in 2012 would again, be the most godless ever. &amp;nbsp;It turns out I was wrong: attendance went up slightly, and the fraction of "Nones" dropped slightly, in both cases reverting toward long-term trends.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: inherit;"&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;My analysis is based on survey results from the&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;a href="http://www.gseis.ucla.edu/heri/cirpoverview.php"&gt;Cooperative Institutional Research Program (CIRP)&lt;/a&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&amp;nbsp;of the&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px;"&gt;&lt;a href="http://www.gseis.ucla.edu/heri/index.php"&gt;Higher Education Research Insitute (HERI)&lt;/a&gt;. &amp;nbsp;In 2012, more than 190,000 students at 283 colleges and universities completed the CIRP Freshman Survey, which&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; font-family: inherit;"&gt;includes questions about students’ backgrounds, activities, and attitudes.&lt;/span&gt;&lt;br /&gt;
&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style="font-family: inherit;"&gt;In one question, students select their “current religious preference,” from a choice of seventeen common religions, “Other religion,” or “None.”&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style="font-family: inherit;"&gt;Another question asks students how often they “attended a religious service” in the last year. The choices are “Frequently,” “Occasionally,” and “Not at all.” Students are instructed to select “Occasionally” if they attended one or more times.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The following figure shows the fraction of Nones over more than 40 years of the survey:&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-dblpEylzrcA/UWQ9mAQFFoI/AAAAAAAABBk/lvdeVHz08hs/s1600/heri12.1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://2.bp.blogspot.com/-dblpEylzrcA/UWQ9mAQFFoI/AAAAAAAABBk/lvdeVHz08hs/s400/heri12.1.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The blue line shows actual data through 2011; the red line shows a quadratic fit to the data. &amp;nbsp;The dark gray region shows a 90% confidence interval, taking into account sampling error, so it reflects uncertainty about the parameters of the fit.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The light gray region shows a 90% confidence interval taking into account both sampling error and residual error. &amp;nbsp;So it reflects total uncertainty about the predicted value, including uncertainty due to random variation from year to year.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
We expect the new data point from 2012, shown as a blue square, to fall within the light gray interval, and it does. &amp;nbsp;In fact, at 23.8% it falls almost exactly on the fitted curve.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Here is the corresponding plot for attendance at religious services:&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-7Y1ZDWBjy4o/UWQ-4S2x7vI/AAAAAAAABBs/mihvN5uDLp4/s1600/heri12.2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://4.bp.blogspot.com/-7Y1ZDWBjy4o/UWQ-4S2x7vI/AAAAAAAABBs/mihvN5uDLp4/s400/heri12.2.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Again, the new data point for 2012, 26.8%, &amp;nbsp;falls comfortably in the predicted range. &amp;nbsp;Don't listen to Nate Silver; prediction is easy :)&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;Predictions for 2013&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Using the new 2012 data, we can generate predictions for 2013. &amp;nbsp;Here is the revised plot for "Nones":&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-EC2ytw8BqzE/UWRLuQoY2sI/AAAAAAAABB8/HLeXqRIoMzA/s1600/heri13.1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://1.bp.blogspot.com/-EC2ytw8BqzE/UWRLuQoY2sI/AAAAAAAABB8/HLeXqRIoMzA/s400/heri13.1.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: left;"&gt;
The prediction for next year is that the fraction of Nones will hit a new all-time high at 25% (up from 23.8%).&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: left;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: left;"&gt;
And here is the prediction for "No attendance":&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-y3Qy-hAmCzI/UWRMWntdqnI/AAAAAAAABCE/Z_MS55mhZj4/s1600/heri13.2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://3.bp.blogspot.com/-y3Qy-hAmCzI/UWRMWntdqnI/AAAAAAAABCE/Z_MS55mhZj4/s400/heri13.2.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The prediction for 2013 is a small decrease to 26.6% (from 26.8%). &amp;nbsp;I'll be back next year to check on these predictions.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;Other updates&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
1) This year the survey repeated two questions from 2010, asking students if they consider themselves "Born again Christian" or "Evangelical". &amp;nbsp;The fraction reporting "Born again" dropped from 22.8% to 20.2%. &amp;nbsp;The fraction who consider themselves Evangelical dropped from 8.9% to 8.5%. &amp;nbsp;But it's too early to declare a trend.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
2) As always, more males than females report no religious preference. &amp;nbsp;The gender gap increased this year, but still falls in the predicted range, as shown in the following plot:&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-thDD7FSNjzs/UWRTRxJC3CI/AAAAAAAABCU/DCGntNFg3Bk/s1600/heri2.2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://2.bp.blogspot.com/-thDD7FSNjzs/UWRTRxJC3CI/AAAAAAAABCU/DCGntNFg3Bk/s400/heri2.2.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
Evidence that the gender gap is increasing is strong. &amp;nbsp;The p-value of the slope of the fitted curve is less than 10e-5.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;Data Source&lt;/b&gt;&lt;br /&gt;
&lt;b&gt;&lt;br /&gt;&lt;/b&gt;
&lt;br /&gt;
&lt;i&gt;&lt;a href="http://heri.ucla.edu/monographs/TheAmericanFreshman2012.pdf"&gt;The American Freshman: National Norms Fall 2012&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;
Pryor, J.H., Eagan, K., Palucki Blake, L., Hurtado, S., Berdan, J., Case, M.H.&lt;br /&gt;
ISBN: 978-1-878477-22-4 &amp;nbsp; &amp;nbsp; 90 pages.&lt;br /&gt;
Jan 2013&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;span style="background-color: white; font-family: inherit;"&gt;This and all previous reports are available from the&amp;nbsp;&lt;a href="http://www.heri.ucla.edu/tfsPublications.php"&gt;HERI publications page&lt;/a&gt;.&lt;/span&gt;&lt;/div&gt;
&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/mjfsHiK1u0Q" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/6755063576816842369/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/04/freshman-hordes-regress-to-mean.html#comment-form" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/6755063576816842369?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/6755063576816842369?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/mjfsHiK1u0Q/freshman-hordes-regress-to-mean.html" title="Freshman hordes regress to the mean" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/-dblpEylzrcA/UWQ9mAQFFoI/AAAAAAAABBk/lvdeVHz08hs/s72-c/heri12.1.png" height="72" width="72" /><thr:total>3</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/04/freshman-hordes-regress-to-mean.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DUEGQXc9cSp7ImA9WhBQGUo.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-1282899499724205263</id><published>2013-03-22T11:40:00.002-07:00</published><updated>2013-03-22T11:40:20.969-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-03-22T11:40:20.969-07:00</app:edited><title>Belly Button Biodiversity: Part Four</title><content type="html">March 22, 2013&lt;br /&gt;
&lt;br /&gt;
Well, I've started testing the predictions I made in &lt;a href="http://allendowney.blogspot.com/2013/02/belly-button-biodiversity-part-three.html"&gt;my previous post&lt;/a&gt;, and exactly as I expected and deserved, I am getting killed. &amp;nbsp;The actual results pretty consistently show more species than I predicted, sometimes way more.&lt;br /&gt;
&lt;br /&gt;
I have started the process of debugging the problem. &amp;nbsp;Of course, now that I have looked at the right answers, I can no longer use this data set for validation, especially since I plan to bang on my algorithm until it produces the right answers.&lt;br /&gt;
&lt;br /&gt;
But in the interest of transparent science, I will at least document the debugging process. &amp;nbsp;My first step was to review the most recent (and least-tested) code for obvious bugs, and I found one. &amp;nbsp;I made an error parsing one of the data files, which had the effect of double-counting the total reads for each subject. &amp;nbsp;I fixed that, but it didn't help the results much.&lt;br /&gt;
&lt;br /&gt;
To start the debugging process, I am looking at the various places where my predictions could go wrong:&lt;br /&gt;
&lt;br /&gt;
0) The data could be wrong. &amp;nbsp;In particular, I assume that the rarefacted data I got from one data file is consistent with the complete dataset I got from another.&lt;br /&gt;
&lt;br /&gt;
1) The posterior distribution could be right, but the predictive distribution could be wrong.&lt;br /&gt;
&lt;br /&gt;
2) The posterior distribution might be wrong because of modeling errors.&lt;br /&gt;
&lt;br /&gt;
3) The posterior distribution might be wrong because of implementation errors.&lt;br /&gt;
&lt;br /&gt;
To check (0), I used the complete dataset to generate a few re-rarefacted datasets to see if my rarefaction process looks like theirs. &amp;nbsp;It does, so I accept the data, at least for now.&lt;br /&gt;
&lt;br /&gt;
To check (1), I used the posterior distribution to generate a 90% credible interval for the total number of species. &amp;nbsp;Since the number of observed species in the complete dataset is necessarily less than the total number of species, the actual values should fall in or below the CIs, but in fact they often exceed the CIs, meaning that there are just more species than my algorithm expects.&lt;br /&gt;
&lt;br /&gt;
While investigating (1) I discovered one problem. &amp;nbsp;In my prior distribution on the number of species, I was setting the upper bound too low, cutting off some values of &lt;i&gt;n&lt;/i&gt; with non-negligible probability. &amp;nbsp;So I cranked it up high enough that any additional increase has no further effect on the results. &amp;nbsp;That helps, but my predictions are still too low.&lt;br /&gt;
&lt;br /&gt;
The next step is to test (2). &amp;nbsp;I will generate simulated datasets, generate predictions and then validation them. &amp;nbsp;Since the simulated data come straight from the model, there can be no modeling errors. &amp;nbsp;If I can't validate on simulated data, the problem has to be the algorithm or the implementation, not the model.&lt;br /&gt;
&lt;br /&gt;
Of course, I should have done all this first, before blowing my testing data.&lt;br /&gt;
&lt;br /&gt;
I won't have a chance to get back to this for a little while, but I'll update this post when I do.&lt;br /&gt;
&lt;br /&gt;&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/BkQrGknsjGo" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/1282899499724205263/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/03/belly-button-biodiversity-part-four.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/1282899499724205263?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/1282899499724205263?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/BkQrGknsjGo/belly-button-biodiversity-part-four.html" title="Belly Button Biodiversity: Part Four" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><thr:total>0</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/03/belly-button-biodiversity-part-four.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DU4BQ3c-fip7ImA9WhBSFUg.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-5942125760585287432</id><published>2013-02-18T11:35:00.000-08:00</published><updated>2013-02-22T10:32:32.956-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-02-22T10:32:32.956-08:00</app:edited><title>Belly Button Biodiversity: Part Three</title><content type="html">This is part three of a series of articles about a Bayesian solution to the&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Species_discovery_curve"&gt;Unseen Species&lt;/a&gt;&amp;nbsp;problem, applied to data from the&amp;nbsp;&lt;a href="http://bbdata.yourwildlife.org/"&gt;Belly Button Biodiversity&lt;/a&gt;&amp;nbsp;project.&lt;br /&gt;
&lt;br /&gt;
In&amp;nbsp;&lt;a href="http://allendowney.blogspot.com/2013/02/belly-button-biodiversity-part-one.html"&gt;Part One&lt;/a&gt;&amp;nbsp;I presented the simplest version of the algorithm, which I think is easy to understand, but slow. &amp;nbsp;In&amp;nbsp;&lt;i&gt;&lt;a href="http://www.greenteapress.com/thinkbayes/html/thinkbayes013.html"&gt;Think Bayes&lt;/a&gt;&lt;/i&gt;&amp;nbsp;I present some ways to optimize it. &amp;nbsp;In &lt;a href="http://allendowney.blogspot.com/2013/02/belly-button-biodiversity-part-two.html"&gt;Part Two&lt;/a&gt; I apply the algorithm to real data and generate predictive distributions. &amp;nbsp;Now in Part Three, as promised, I publish the predictions the algorithm generates. &amp;nbsp;In Part Four I will compare the predictions to actual data.&lt;br /&gt;
&lt;br /&gt;
Background: Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen science project with the goal of identifying bacterial species that can be found in human navels (&lt;a href="http://bbdata.yourwildlife.org/"&gt;http://bbdata.yourwildlife.org&lt;/a&gt;).&lt;br /&gt;
&lt;br /&gt;
&lt;h4&gt;
Transparent science&lt;/h4&gt;
In an effort to explore the limits of transparent science, I have started publishing my research in this blog as I go along. &amp;nbsp;This past summer I wrote a&lt;a href="http://allendowney.blogspot.com/2012/07/secularization-in-america-part-seven.html"&gt; series of articles&lt;/a&gt; exploring the relationship between Internet use and religious disaffiliation. &amp;nbsp;This "publish as you go" model should help keep researchers honest. &amp;nbsp;Among other things, it might mitigate publication bias due to the "&lt;a href="http://en.wikipedia.org/wiki/Publication_bias#File_drawer_effect"&gt;file drawer effect&lt;/a&gt;." &amp;nbsp;And if the data and code are published along with the results, that should help make experiments more &lt;a href="http://en.wikipedia.org/wiki/Reproducibility"&gt;reproducible&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
Toward that end, I will now subject myself to public humiliation by generating a set of predictions using my almost-entirely-unvalidated solution to the Unseen Species problem. &amp;nbsp;In the next installment I will publish the correct answers and score my predictions. &amp;nbsp;Here are the details:&lt;br /&gt;
&lt;br /&gt;
1) I am working with data from the Belly Button Biodiversity project; this data was used in a paper published in&amp;nbsp;&lt;a href="http://www.plosone.org/static/information"&gt;PLOS ONE&lt;/a&gt; and made available on the web pages of the &lt;a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0047712"&gt;journal&lt;/a&gt; and the &lt;a href="http://bbdata.yourwildlife.org/download-the-data/"&gt;researchers&lt;/a&gt;. &amp;nbsp;The data consists of rDNA "reads" from 60 subjects. &amp;nbsp;In order to facilitate comparisons between subjects, the researchers chose subjects with at least 400 reads, and for each subject they chose a random subset of 400 reads. &amp;nbsp;The data for the other reads was not published.&lt;br /&gt;
&lt;br /&gt;
2) For each subject, I know the results of the 400 selected reads, and the total number of reads. &amp;nbsp;I will use my algorithm to generate a "prediction" for each subject, which is the number of additional species in the complete dataset.&lt;br /&gt;
&lt;br /&gt;
3) Specifically, for each subject I will generate 9 posterior credible intervals (CIs) for the number of additional species: a 10% CI, a 20% CI, and so on up to a 90% CI.&lt;br /&gt;
&lt;br /&gt;
4) To validate my predictions, I will count the number of CIs that contain the actual, correct value. &amp;nbsp;Ideally, 10% of the correct values should fall in the 10% CIs, 20% should fall in the 20% CIs, and so on. &amp;nbsp;Since the predictions and actual values are integers, a value that hits one end of a predicted CI counts as a half-hit.&lt;br /&gt;
&lt;br /&gt;
&lt;h4&gt;
&amp;nbsp;Predictions&lt;/h4&gt;
And now, without further ado, here are my predictions. &amp;nbsp;The columns labelled 10, 20, etc. are 10% credible intervals, 20% CIs, and so on.&lt;br /&gt;
&lt;span style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;
&lt;br /&gt;
&lt;table border="1" cellpadding="4" style="border-collapse: collapse; border: 1px solid #000000;"&gt;
 &lt;tbody&gt;
&lt;tr&gt;
  &lt;th&gt;Code&lt;/th&gt;
  &lt;th&gt;# reads&lt;/th&gt;
  &lt;th&gt;# species&lt;/th&gt;
  &lt;th&gt;10&lt;/th&gt;
  &lt;th&gt;20&lt;/th&gt;
  &lt;th&gt;30&lt;/th&gt;
  &lt;th&gt;40&lt;/th&gt;
  &lt;th&gt;50&lt;/th&gt;
  &lt;th&gt;60&lt;/th&gt;
  &lt;th&gt;70&lt;/th&gt;
  &lt;th&gt;80&lt;/th&gt;
  &lt;th&gt;90&lt;/th&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1234&lt;/td&gt;
  &lt;td&gt;1392&lt;/td&gt;
  &lt;td&gt;48&lt;/td&gt;
  &lt;td&gt;(4, 4)&lt;/td&gt;
  &lt;td&gt;(4, 5)&lt;/td&gt;
  &lt;td&gt;(3, 5)&lt;/td&gt;
  &lt;td&gt;(3, 5)&lt;/td&gt;
  &lt;td&gt;(3, 6)&lt;/td&gt;
  &lt;td&gt;(2, 6)&lt;/td&gt;
  &lt;td&gt;(2, 7)&lt;/td&gt;
  &lt;td&gt;(1, 7)&lt;/td&gt;
  &lt;td&gt;(1, 9)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1235&lt;/td&gt;
  &lt;td&gt;2452&lt;/td&gt;
  &lt;td&gt;69&lt;/td&gt;
  &lt;td&gt;(11, 12)&lt;/td&gt;
  &lt;td&gt;(10, 12)&lt;/td&gt;
  &lt;td&gt;(10, 13)&lt;/td&gt;
  &lt;td&gt;(9, 13)&lt;/td&gt;
  &lt;td&gt;(8, 14)&lt;/td&gt;
  &lt;td&gt;(8, 15)&lt;/td&gt;
  &lt;td&gt;(7, 16)&lt;/td&gt;
  &lt;td&gt;(6, 17)&lt;/td&gt;
  &lt;td&gt;(5, 19)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1236&lt;/td&gt;
  &lt;td&gt;2964&lt;/td&gt;
  &lt;td&gt;45&lt;/td&gt;
  &lt;td&gt;(4, 5)&lt;/td&gt;
  &lt;td&gt;(4, 5)&lt;/td&gt;
  &lt;td&gt;(4, 6)&lt;/td&gt;
  &lt;td&gt;(4, 6)&lt;/td&gt;
  &lt;td&gt;(3, 7)&lt;/td&gt;
  &lt;td&gt;(3, 7)&lt;/td&gt;
  &lt;td&gt;(3, 8)&lt;/td&gt;
  &lt;td&gt;(2, 9)&lt;/td&gt;
  &lt;td&gt;(1, 10)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1237&lt;/td&gt;
  &lt;td&gt;3090&lt;/td&gt;
  &lt;td&gt;62&lt;/td&gt;
  &lt;td&gt;(9, 10)&lt;/td&gt;
  &lt;td&gt;(9, 11)&lt;/td&gt;
  &lt;td&gt;(8, 11)&lt;/td&gt;
  &lt;td&gt;(8, 11)&lt;/td&gt;
  &lt;td&gt;(7, 12)&lt;/td&gt;
  &lt;td&gt;(7, 12)&lt;/td&gt;
  &lt;td&gt;(6, 13)&lt;/td&gt;
  &lt;td&gt;(5, 14)&lt;/td&gt;
  &lt;td&gt;(4, 16)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1242&lt;/td&gt;
  &lt;td&gt;3056&lt;/td&gt;
  &lt;td&gt;61&lt;/td&gt;
  &lt;td&gt;(9, 9)&lt;/td&gt;
  &lt;td&gt;(8, 10)&lt;/td&gt;
  &lt;td&gt;(8, 10)&lt;/td&gt;
  &lt;td&gt;(7, 11)&lt;/td&gt;
  &lt;td&gt;(7, 11)&lt;/td&gt;
  &lt;td&gt;(6, 12)&lt;/td&gt;
  &lt;td&gt;(6, 14)&lt;/td&gt;
  &lt;td&gt;(5, 15)&lt;/td&gt;
  &lt;td&gt;(5, 16)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1243&lt;/td&gt;
  &lt;td&gt;1518&lt;/td&gt;
  &lt;td&gt;71&lt;/td&gt;
  &lt;td&gt;(10, 11)&lt;/td&gt;
  &lt;td&gt;(10, 12)&lt;/td&gt;
  &lt;td&gt;(9, 12)&lt;/td&gt;
  &lt;td&gt;(8, 13)&lt;/td&gt;
  &lt;td&gt;(8, 13)&lt;/td&gt;
  &lt;td&gt;(8, 14)&lt;/td&gt;
  &lt;td&gt;(7, 15)&lt;/td&gt;
  &lt;td&gt;(6, 16)&lt;/td&gt;
  &lt;td&gt;(5, 17)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1246&lt;/td&gt;
  &lt;td&gt;4230&lt;/td&gt;
  &lt;td&gt;91&lt;/td&gt;
  &lt;td&gt;(23, 24)&lt;/td&gt;
  &lt;td&gt;(22, 25)&lt;/td&gt;
  &lt;td&gt;(21, 26)&lt;/td&gt;
  &lt;td&gt;(21, 27)&lt;/td&gt;
  &lt;td&gt;(19, 28)&lt;/td&gt;
  &lt;td&gt;(18, 29)&lt;/td&gt;
  &lt;td&gt;(17, 30)&lt;/td&gt;
  &lt;td&gt;(16, 33)&lt;/td&gt;
  &lt;td&gt;(14, 35)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1253&lt;/td&gt;
  &lt;td&gt;1928&lt;/td&gt;
  &lt;td&gt;86&lt;/td&gt;
  &lt;td&gt;(16, 17)&lt;/td&gt;
  &lt;td&gt;(15, 18)&lt;/td&gt;
  &lt;td&gt;(14, 18)&lt;/td&gt;
  &lt;td&gt;(14, 20)&lt;/td&gt;
  &lt;td&gt;(13, 20)&lt;/td&gt;
  &lt;td&gt;(13, 21)&lt;/td&gt;
  &lt;td&gt;(12, 23)&lt;/td&gt;
  &lt;td&gt;(11, 24)&lt;/td&gt;
  &lt;td&gt;(10, 26)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1254&lt;/td&gt;
  &lt;td&gt;918&lt;/td&gt;
  &lt;td&gt;58&lt;/td&gt;
  &lt;td&gt;(5, 5)&lt;/td&gt;
  &lt;td&gt;(4, 6)&lt;/td&gt;
  &lt;td&gt;(4, 6)&lt;/td&gt;
  &lt;td&gt;(3, 6)&lt;/td&gt;
  &lt;td&gt;(3, 7)&lt;/td&gt;
  &lt;td&gt;(3, 7)&lt;/td&gt;
  &lt;td&gt;(2, 8)&lt;/td&gt;
  &lt;td&gt;(2, 9)&lt;/td&gt;
  &lt;td&gt;(1, 10)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1258&lt;/td&gt;
  &lt;td&gt;1350&lt;/td&gt;
  &lt;td&gt;87&lt;/td&gt;
  &lt;td&gt;(15, 16)&lt;/td&gt;
  &lt;td&gt;(14, 17)&lt;/td&gt;
  &lt;td&gt;(14, 17)&lt;/td&gt;
  &lt;td&gt;(13, 18)&lt;/td&gt;
  &lt;td&gt;(12, 19)&lt;/td&gt;
  &lt;td&gt;(11, 19)&lt;/td&gt;
  &lt;td&gt;(11, 20)&lt;/td&gt;
  &lt;td&gt;(10, 21)&lt;/td&gt;
  &lt;td&gt;(8, 24)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1259&lt;/td&gt;
  &lt;td&gt;1002&lt;/td&gt;
  &lt;td&gt;80&lt;/td&gt;
  &lt;td&gt;(10, 11)&lt;/td&gt;
  &lt;td&gt;(10, 12)&lt;/td&gt;
  &lt;td&gt;(10, 12)&lt;/td&gt;
  &lt;td&gt;(9, 13)&lt;/td&gt;
  &lt;td&gt;(9, 14)&lt;/td&gt;
  &lt;td&gt;(8, 14)&lt;/td&gt;
  &lt;td&gt;(7, 15)&lt;/td&gt;
  &lt;td&gt;(6, 16)&lt;/td&gt;
  &lt;td&gt;(6, 18)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1260&lt;/td&gt;
  &lt;td&gt;1944&lt;/td&gt;
  &lt;td&gt;96&lt;/td&gt;
  &lt;td&gt;(22, 23)&lt;/td&gt;
  &lt;td&gt;(21, 24)&lt;/td&gt;
  &lt;td&gt;(20, 25)&lt;/td&gt;
  &lt;td&gt;(19, 25)&lt;/td&gt;
  &lt;td&gt;(19, 26)&lt;/td&gt;
  &lt;td&gt;(18, 27)&lt;/td&gt;
  &lt;td&gt;(17, 29)&lt;/td&gt;
  &lt;td&gt;(15, 30)&lt;/td&gt;
  &lt;td&gt;(14, 32)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1264&lt;/td&gt;
  &lt;td&gt;1122&lt;/td&gt;
  &lt;td&gt;83&lt;/td&gt;
  &lt;td&gt;(12, 13)&lt;/td&gt;
  &lt;td&gt;(12, 14)&lt;/td&gt;
  &lt;td&gt;(11, 14)&lt;/td&gt;
  &lt;td&gt;(10, 15)&lt;/td&gt;
  &lt;td&gt;(10, 15)&lt;/td&gt;
  &lt;td&gt;(9, 16)&lt;/td&gt;
  &lt;td&gt;(8, 17)&lt;/td&gt;
  &lt;td&gt;(7, 18)&lt;/td&gt;
  &lt;td&gt;(6, 20)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1265&lt;/td&gt;
  &lt;td&gt;2928&lt;/td&gt;
  &lt;td&gt;85&lt;/td&gt;
  &lt;td&gt;(18, 19)&lt;/td&gt;
  &lt;td&gt;(17, 20)&lt;/td&gt;
  &lt;td&gt;(16, 21)&lt;/td&gt;
  &lt;td&gt;(16, 22)&lt;/td&gt;
  &lt;td&gt;(15, 23)&lt;/td&gt;
  &lt;td&gt;(14, 24)&lt;/td&gt;
  &lt;td&gt;(13, 25)&lt;/td&gt;
  &lt;td&gt;(12, 26)&lt;/td&gt;
  &lt;td&gt;(11, 28)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1273&lt;/td&gt;
  &lt;td&gt;2980&lt;/td&gt;
  &lt;td&gt;61&lt;/td&gt;
  &lt;td&gt;(9, 9)&lt;/td&gt;
  &lt;td&gt;(8, 10)&lt;/td&gt;
  &lt;td&gt;(8, 10)&lt;/td&gt;
  &lt;td&gt;(7, 11)&lt;/td&gt;
  &lt;td&gt;(7, 12)&lt;/td&gt;
  &lt;td&gt;(6, 12)&lt;/td&gt;
  &lt;td&gt;(6, 13)&lt;/td&gt;
  &lt;td&gt;(5, 14)&lt;/td&gt;
  &lt;td&gt;(4, 16)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1275&lt;/td&gt;
  &lt;td&gt;1672&lt;/td&gt;
  &lt;td&gt;85&lt;/td&gt;
  &lt;td&gt;(16, 17)&lt;/td&gt;
  &lt;td&gt;(15, 18)&lt;/td&gt;
  &lt;td&gt;(15, 19)&lt;/td&gt;
  &lt;td&gt;(14, 19)&lt;/td&gt;
  &lt;td&gt;(13, 20)&lt;/td&gt;
  &lt;td&gt;(13, 21)&lt;/td&gt;
  &lt;td&gt;(12, 22)&lt;/td&gt;
  &lt;td&gt;(11, 24)&lt;/td&gt;
  &lt;td&gt;(9, 25)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1278&lt;/td&gt;
  &lt;td&gt;1242&lt;/td&gt;
  &lt;td&gt;47&lt;/td&gt;
  &lt;td&gt;(4, 4)&lt;/td&gt;
  &lt;td&gt;(3, 4)&lt;/td&gt;
  &lt;td&gt;(3, 5)&lt;/td&gt;
  &lt;td&gt;(3, 5)&lt;/td&gt;
  &lt;td&gt;(2, 6)&lt;/td&gt;
  &lt;td&gt;(2, 6)&lt;/td&gt;
  &lt;td&gt;(2, 6)&lt;/td&gt;
  &lt;td&gt;(2, 7)&lt;/td&gt;
  &lt;td&gt;(1, 8)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1280&lt;/td&gt;
  &lt;td&gt;1772&lt;/td&gt;
  &lt;td&gt;46&lt;/td&gt;
  &lt;td&gt;(4, 4)&lt;/td&gt;
  &lt;td&gt;(4, 5)&lt;/td&gt;
  &lt;td&gt;(3, 5)&lt;/td&gt;
  &lt;td&gt;(3, 5)&lt;/td&gt;
  &lt;td&gt;(3, 6)&lt;/td&gt;
  &lt;td&gt;(2, 6)&lt;/td&gt;
  &lt;td&gt;(2, 7)&lt;/td&gt;
  &lt;td&gt;(2, 8)&lt;/td&gt;
  &lt;td&gt;(1, 9)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1282&lt;/td&gt;
  &lt;td&gt;1132&lt;/td&gt;
  &lt;td&gt;67&lt;/td&gt;
  &lt;td&gt;(8, 9)&lt;/td&gt;
  &lt;td&gt;(7, 9)&lt;/td&gt;
  &lt;td&gt;(7, 10)&lt;/td&gt;
  &lt;td&gt;(6, 10)&lt;/td&gt;
  &lt;td&gt;(6, 11)&lt;/td&gt;
  &lt;td&gt;(6, 11)&lt;/td&gt;
  &lt;td&gt;(5, 12)&lt;/td&gt;
  &lt;td&gt;(5, 13)&lt;/td&gt;
  &lt;td&gt;(3, 15)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1283&lt;/td&gt;
  &lt;td&gt;1414&lt;/td&gt;
  &lt;td&gt;67&lt;/td&gt;
  &lt;td&gt;(8, 9)&lt;/td&gt;
  &lt;td&gt;(8, 10)&lt;/td&gt;
  &lt;td&gt;(7, 10)&lt;/td&gt;
  &lt;td&gt;(7, 11)&lt;/td&gt;
  &lt;td&gt;(7, 11)&lt;/td&gt;
  &lt;td&gt;(6, 12)&lt;/td&gt;
  &lt;td&gt;(5, 13)&lt;/td&gt;
  &lt;td&gt;(4, 14)&lt;/td&gt;
  &lt;td&gt;(3, 16)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1284&lt;/td&gt;
  &lt;td&gt;1158&lt;/td&gt;
  &lt;td&gt;91&lt;/td&gt;
  &lt;td&gt;(15, 16)&lt;/td&gt;
  &lt;td&gt;(14, 17)&lt;/td&gt;
  &lt;td&gt;(14, 17)&lt;/td&gt;
  &lt;td&gt;(13, 18)&lt;/td&gt;
  &lt;td&gt;(13, 19)&lt;/td&gt;
  &lt;td&gt;(12, 19)&lt;/td&gt;
  &lt;td&gt;(12, 20)&lt;/td&gt;
  &lt;td&gt;(10, 21)&lt;/td&gt;
  &lt;td&gt;(9, 23)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1285&lt;/td&gt;
  &lt;td&gt;2340&lt;/td&gt;
  &lt;td&gt;55&lt;/td&gt;
  &lt;td&gt;(7, 7)&lt;/td&gt;
  &lt;td&gt;(6, 8)&lt;/td&gt;
  &lt;td&gt;(6, 8)&lt;/td&gt;
  &lt;td&gt;(5, 8)&lt;/td&gt;
  &lt;td&gt;(5, 9)&lt;/td&gt;
  &lt;td&gt;(4, 9)&lt;/td&gt;
  &lt;td&gt;(4, 10)&lt;/td&gt;
  &lt;td&gt;(3, 12)&lt;/td&gt;
  &lt;td&gt;(2, 13)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1286&lt;/td&gt;
  &lt;td&gt;1728&lt;/td&gt;
  &lt;td&gt;66&lt;/td&gt;
  &lt;td&gt;(9, 10)&lt;/td&gt;
  &lt;td&gt;(9, 11)&lt;/td&gt;
  &lt;td&gt;(8, 11)&lt;/td&gt;
  &lt;td&gt;(8, 12)&lt;/td&gt;
  &lt;td&gt;(8, 12)&lt;/td&gt;
  &lt;td&gt;(7, 13)&lt;/td&gt;
  &lt;td&gt;(6, 14)&lt;/td&gt;
  &lt;td&gt;(6, 14)&lt;/td&gt;
  &lt;td&gt;(4, 16)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1288&lt;/td&gt;
  &lt;td&gt;1280&lt;/td&gt;
  &lt;td&gt;107&lt;/td&gt;
  &lt;td&gt;(23, 24)&lt;/td&gt;
  &lt;td&gt;(22, 25)&lt;/td&gt;
  &lt;td&gt;(21, 25)&lt;/td&gt;
  &lt;td&gt;(21, 26)&lt;/td&gt;
  &lt;td&gt;(20, 27)&lt;/td&gt;
  &lt;td&gt;(19, 27)&lt;/td&gt;
  &lt;td&gt;(18, 29)&lt;/td&gt;
  &lt;td&gt;(17, 31)&lt;/td&gt;
  &lt;td&gt;(15, 32)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1289&lt;/td&gt;
  &lt;td&gt;2054&lt;/td&gt;
  &lt;td&gt;103&lt;/td&gt;
  &lt;td&gt;(26, 27)&lt;/td&gt;
  &lt;td&gt;(25, 28)&lt;/td&gt;
  &lt;td&gt;(24, 29)&lt;/td&gt;
  &lt;td&gt;(23, 30)&lt;/td&gt;
  &lt;td&gt;(23, 30)&lt;/td&gt;
  &lt;td&gt;(22, 32)&lt;/td&gt;
  &lt;td&gt;(21, 33)&lt;/td&gt;
  &lt;td&gt;(20, 34)&lt;/td&gt;
  &lt;td&gt;(17, 36)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1291&lt;/td&gt;
  &lt;td&gt;1248&lt;/td&gt;
  &lt;td&gt;94&lt;/td&gt;
  &lt;td&gt;(17, 18)&lt;/td&gt;
  &lt;td&gt;(16, 19)&lt;/td&gt;
  &lt;td&gt;(16, 20)&lt;/td&gt;
  &lt;td&gt;(15, 20)&lt;/td&gt;
  &lt;td&gt;(15, 21)&lt;/td&gt;
  &lt;td&gt;(13, 22)&lt;/td&gt;
  &lt;td&gt;(13, 23)&lt;/td&gt;
  &lt;td&gt;(12, 25)&lt;/td&gt;
  &lt;td&gt;(10, 27)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1292&lt;/td&gt;
  &lt;td&gt;1864&lt;/td&gt;
  &lt;td&gt;82&lt;/td&gt;
  &lt;td&gt;(15, 16)&lt;/td&gt;
  &lt;td&gt;(14, 16)&lt;/td&gt;
  &lt;td&gt;(13, 17)&lt;/td&gt;
  &lt;td&gt;(13, 18)&lt;/td&gt;
  &lt;td&gt;(13, 19)&lt;/td&gt;
  &lt;td&gt;(12, 20)&lt;/td&gt;
  &lt;td&gt;(11, 21)&lt;/td&gt;
  &lt;td&gt;(10, 22)&lt;/td&gt;
  &lt;td&gt;(9, 24)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1293&lt;/td&gt;
  &lt;td&gt;1904&lt;/td&gt;
  &lt;td&gt;76&lt;/td&gt;
  &lt;td&gt;(13, 14)&lt;/td&gt;
  &lt;td&gt;(12, 14)&lt;/td&gt;
  &lt;td&gt;(12, 15)&lt;/td&gt;
  &lt;td&gt;(11, 16)&lt;/td&gt;
  &lt;td&gt;(11, 16)&lt;/td&gt;
  &lt;td&gt;(10, 17)&lt;/td&gt;
  &lt;td&gt;(9, 18)&lt;/td&gt;
  &lt;td&gt;(8, 19)&lt;/td&gt;
  &lt;td&gt;(7, 22)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1294&lt;/td&gt;
  &lt;td&gt;1784&lt;/td&gt;
  &lt;td&gt;78&lt;/td&gt;
  &lt;td&gt;(14, 15)&lt;/td&gt;
  &lt;td&gt;(13, 16)&lt;/td&gt;
  &lt;td&gt;(12, 16)&lt;/td&gt;
  &lt;td&gt;(12, 17)&lt;/td&gt;
  &lt;td&gt;(11, 18)&lt;/td&gt;
  &lt;td&gt;(11, 19)&lt;/td&gt;
  &lt;td&gt;(10, 19)&lt;/td&gt;
  &lt;td&gt;(9, 20)&lt;/td&gt;
  &lt;td&gt;(8, 23)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1295&lt;/td&gt;
  &lt;td&gt;1408&lt;/td&gt;
  &lt;td&gt;70&lt;/td&gt;
  &lt;td&gt;(10, 10)&lt;/td&gt;
  &lt;td&gt;(9, 11)&lt;/td&gt;
  &lt;td&gt;(9, 12)&lt;/td&gt;
  &lt;td&gt;(8, 12)&lt;/td&gt;
  &lt;td&gt;(8, 12)&lt;/td&gt;
  &lt;td&gt;(7, 13)&lt;/td&gt;
  &lt;td&gt;(7, 14)&lt;/td&gt;
  &lt;td&gt;(6, 15)&lt;/td&gt;
  &lt;td&gt;(4, 17)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1296&lt;/td&gt;
  &lt;td&gt;2034&lt;/td&gt;
  &lt;td&gt;55&lt;/td&gt;
  &lt;td&gt;(7, 7)&lt;/td&gt;
  &lt;td&gt;(6, 8)&lt;/td&gt;
  &lt;td&gt;(6, 8)&lt;/td&gt;
  &lt;td&gt;(6, 8)&lt;/td&gt;
  &lt;td&gt;(5, 9)&lt;/td&gt;
  &lt;td&gt;(4, 9)&lt;/td&gt;
  &lt;td&gt;(4, 10)&lt;/td&gt;
  &lt;td&gt;(4, 11)&lt;/td&gt;
  &lt;td&gt;(3, 12)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1298&lt;/td&gt;
  &lt;td&gt;1478&lt;/td&gt;
  &lt;td&gt;72&lt;/td&gt;
  &lt;td&gt;(10, 11)&lt;/td&gt;
  &lt;td&gt;(9, 12)&lt;/td&gt;
  &lt;td&gt;(9, 12)&lt;/td&gt;
  &lt;td&gt;(9, 13)&lt;/td&gt;
  &lt;td&gt;(8, 13)&lt;/td&gt;
  &lt;td&gt;(8, 14)&lt;/td&gt;
  &lt;td&gt;(7, 15)&lt;/td&gt;
  &lt;td&gt;(6, 16)&lt;/td&gt;
  &lt;td&gt;(5, 18)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1308&lt;/td&gt;
  &lt;td&gt;1160&lt;/td&gt;
  &lt;td&gt;58&lt;/td&gt;
  &lt;td&gt;(6, 6)&lt;/td&gt;
  &lt;td&gt;(5, 7)&lt;/td&gt;
  &lt;td&gt;(5, 7)&lt;/td&gt;
  &lt;td&gt;(5, 7)&lt;/td&gt;
  &lt;td&gt;(4, 8)&lt;/td&gt;
  &lt;td&gt;(4, 8)&lt;/td&gt;
  &lt;td&gt;(3, 9)&lt;/td&gt;
  &lt;td&gt;(3, 10)&lt;/td&gt;
  &lt;td&gt;(2, 11)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1310&lt;/td&gt;
  &lt;td&gt;1066&lt;/td&gt;
  &lt;td&gt;80&lt;/td&gt;
  &lt;td&gt;(11, 12)&lt;/td&gt;
  &lt;td&gt;(11, 13)&lt;/td&gt;
  &lt;td&gt;(10, 13)&lt;/td&gt;
  &lt;td&gt;(9, 14)&lt;/td&gt;
  &lt;td&gt;(9, 15)&lt;/td&gt;
  &lt;td&gt;(8, 15)&lt;/td&gt;
  &lt;td&gt;(7, 16)&lt;/td&gt;
  &lt;td&gt;(7, 17)&lt;/td&gt;
  &lt;td&gt;(5, 19)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B1374&lt;/td&gt;
  &lt;td&gt;2364&lt;/td&gt;
  &lt;td&gt;48&lt;/td&gt;
  &lt;td&gt;(5, 5)&lt;/td&gt;
  &lt;td&gt;(4, 6)&lt;/td&gt;
  &lt;td&gt;(4, 6)&lt;/td&gt;
  &lt;td&gt;(4, 6)&lt;/td&gt;
  &lt;td&gt;(3, 7)&lt;/td&gt;
  &lt;td&gt;(3, 7)&lt;/td&gt;
  &lt;td&gt;(3, 8)&lt;/td&gt;
  &lt;td&gt;(2, 9)&lt;/td&gt;
  &lt;td&gt;(2, 10)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B940&lt;/td&gt;
  &lt;td&gt;2874&lt;/td&gt;
  &lt;td&gt;93&lt;/td&gt;
  &lt;td&gt;(22, 24)&lt;/td&gt;
  &lt;td&gt;(21, 25)&lt;/td&gt;
  &lt;td&gt;(21, 25)&lt;/td&gt;
  &lt;td&gt;(20, 26)&lt;/td&gt;
  &lt;td&gt;(19, 27)&lt;/td&gt;
  &lt;td&gt;(19, 28)&lt;/td&gt;
  &lt;td&gt;(18, 30)&lt;/td&gt;
  &lt;td&gt;(16, 32)&lt;/td&gt;
  &lt;td&gt;(14, 33)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B941&lt;/td&gt;
  &lt;td&gt;2154&lt;/td&gt;
  &lt;td&gt;48&lt;/td&gt;
  &lt;td&gt;(5, 6)&lt;/td&gt;
  &lt;td&gt;(5, 6)&lt;/td&gt;
  &lt;td&gt;(4, 6)&lt;/td&gt;
  &lt;td&gt;(4, 7)&lt;/td&gt;
  &lt;td&gt;(4, 7)&lt;/td&gt;
  &lt;td&gt;(3, 7)&lt;/td&gt;
  &lt;td&gt;(3, 8)&lt;/td&gt;
  &lt;td&gt;(2, 9)&lt;/td&gt;
  &lt;td&gt;(2, 11)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B944&lt;/td&gt;
  &lt;td&gt;954&lt;/td&gt;
  &lt;td&gt;52&lt;/td&gt;
  &lt;td&gt;(4, 4)&lt;/td&gt;
  &lt;td&gt;(4, 5)&lt;/td&gt;
  &lt;td&gt;(3, 5)&lt;/td&gt;
  &lt;td&gt;(3, 5)&lt;/td&gt;
  &lt;td&gt;(3, 6)&lt;/td&gt;
  &lt;td&gt;(2, 6)&lt;/td&gt;
  &lt;td&gt;(2, 6)&lt;/td&gt;
  &lt;td&gt;(2, 7)&lt;/td&gt;
  &lt;td&gt;(1, 9)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B945&lt;/td&gt;
  &lt;td&gt;2390&lt;/td&gt;
  &lt;td&gt;67&lt;/td&gt;
  &lt;td&gt;(10, 11)&lt;/td&gt;
  &lt;td&gt;(10, 12)&lt;/td&gt;
  &lt;td&gt;(9, 12)&lt;/td&gt;
  &lt;td&gt;(9, 13)&lt;/td&gt;
  &lt;td&gt;(8, 13)&lt;/td&gt;
  &lt;td&gt;(8, 14)&lt;/td&gt;
  &lt;td&gt;(7, 15)&lt;/td&gt;
  &lt;td&gt;(7, 16)&lt;/td&gt;
  &lt;td&gt;(5, 17)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B946&lt;/td&gt;
  &lt;td&gt;5012&lt;/td&gt;
  &lt;td&gt;85&lt;/td&gt;
  &lt;td&gt;(20, 21)&lt;/td&gt;
  &lt;td&gt;(19, 22)&lt;/td&gt;
  &lt;td&gt;(19, 23)&lt;/td&gt;
  &lt;td&gt;(18, 24)&lt;/td&gt;
  &lt;td&gt;(18, 24)&lt;/td&gt;
  &lt;td&gt;(17, 26)&lt;/td&gt;
  &lt;td&gt;(16, 27)&lt;/td&gt;
  &lt;td&gt;(15, 28)&lt;/td&gt;
  &lt;td&gt;(12, 31)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B947&lt;/td&gt;
  &lt;td&gt;3356&lt;/td&gt;
  &lt;td&gt;62&lt;/td&gt;
  &lt;td&gt;(10, 11)&lt;/td&gt;
  &lt;td&gt;(9, 11)&lt;/td&gt;
  &lt;td&gt;(9, 12)&lt;/td&gt;
  &lt;td&gt;(8, 12)&lt;/td&gt;
  &lt;td&gt;(7, 13)&lt;/td&gt;
  &lt;td&gt;(7, 14)&lt;/td&gt;
  &lt;td&gt;(6, 14)&lt;/td&gt;
  &lt;td&gt;(5, 15)&lt;/td&gt;
  &lt;td&gt;(5, 17)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B948&lt;/td&gt;
  &lt;td&gt;2384&lt;/td&gt;
  &lt;td&gt;80&lt;/td&gt;
  &lt;td&gt;(16, 17)&lt;/td&gt;
  &lt;td&gt;(15, 18)&lt;/td&gt;
  &lt;td&gt;(14, 18)&lt;/td&gt;
  &lt;td&gt;(14, 19)&lt;/td&gt;
  &lt;td&gt;(13, 20)&lt;/td&gt;
  &lt;td&gt;(12, 21)&lt;/td&gt;
  &lt;td&gt;(11, 22)&lt;/td&gt;
  &lt;td&gt;(10, 23)&lt;/td&gt;
  &lt;td&gt;(9, 25)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B950&lt;/td&gt;
  &lt;td&gt;1560&lt;/td&gt;
  &lt;td&gt;63&lt;/td&gt;
  &lt;td&gt;(8, 9)&lt;/td&gt;
  &lt;td&gt;(8, 10)&lt;/td&gt;
  &lt;td&gt;(8, 10)&lt;/td&gt;
  &lt;td&gt;(7, 10)&lt;/td&gt;
  &lt;td&gt;(7, 11)&lt;/td&gt;
  &lt;td&gt;(6, 11)&lt;/td&gt;
  &lt;td&gt;(5, 12)&lt;/td&gt;
  &lt;td&gt;(5, 13)&lt;/td&gt;
  &lt;td&gt;(4, 15)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B952&lt;/td&gt;
  &lt;td&gt;1648&lt;/td&gt;
  &lt;td&gt;57&lt;/td&gt;
  &lt;td&gt;(7, 7)&lt;/td&gt;
  &lt;td&gt;(6, 8)&lt;/td&gt;
  &lt;td&gt;(6, 8)&lt;/td&gt;
  &lt;td&gt;(6, 8)&lt;/td&gt;
  &lt;td&gt;(5, 9)&lt;/td&gt;
  &lt;td&gt;(5, 9)&lt;/td&gt;
  &lt;td&gt;(4, 10)&lt;/td&gt;
  &lt;td&gt;(3, 11)&lt;/td&gt;
  &lt;td&gt;(3, 12)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B953&lt;/td&gt;
  &lt;td&gt;1452&lt;/td&gt;
  &lt;td&gt;32&lt;/td&gt;
  &lt;td&gt;(2, 2)&lt;/td&gt;
  &lt;td&gt;(1, 2)&lt;/td&gt;
  &lt;td&gt;(1, 2)&lt;/td&gt;
  &lt;td&gt;(1, 3)&lt;/td&gt;
  &lt;td&gt;(1, 3)&lt;/td&gt;
  &lt;td&gt;(1, 3)&lt;/td&gt;
  &lt;td&gt;(0, 3)&lt;/td&gt;
  &lt;td&gt;(0, 4)&lt;/td&gt;
  &lt;td&gt;(0, 5)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B954&lt;/td&gt;
  &lt;td&gt;1996&lt;/td&gt;
  &lt;td&gt;29&lt;/td&gt;
  &lt;td&gt;(2, 2)&lt;/td&gt;
  &lt;td&gt;(1, 2)&lt;/td&gt;
  &lt;td&gt;(1, 2)&lt;/td&gt;
  &lt;td&gt;(1, 2)&lt;/td&gt;
  &lt;td&gt;(1, 3)&lt;/td&gt;
  &lt;td&gt;(1, 3)&lt;/td&gt;
  &lt;td&gt;(0, 3)&lt;/td&gt;
  &lt;td&gt;(0, 4)&lt;/td&gt;
  &lt;td&gt;(0, 4)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B955&lt;/td&gt;
  &lt;td&gt;1474&lt;/td&gt;
  &lt;td&gt;65&lt;/td&gt;
  &lt;td&gt;(8, 9)&lt;/td&gt;
  &lt;td&gt;(8, 9)&lt;/td&gt;
  &lt;td&gt;(7, 9)&lt;/td&gt;
  &lt;td&gt;(7, 10)&lt;/td&gt;
  &lt;td&gt;(7, 10)&lt;/td&gt;
  &lt;td&gt;(6, 11)&lt;/td&gt;
  &lt;td&gt;(5, 12)&lt;/td&gt;
  &lt;td&gt;(5, 13)&lt;/td&gt;
  &lt;td&gt;(4, 14)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B956&lt;/td&gt;
  &lt;td&gt;1482&lt;/td&gt;
  &lt;td&gt;71&lt;/td&gt;
  &lt;td&gt;(10, 11)&lt;/td&gt;
  &lt;td&gt;(10, 12)&lt;/td&gt;
  &lt;td&gt;(10, 12)&lt;/td&gt;
  &lt;td&gt;(9, 13)&lt;/td&gt;
  &lt;td&gt;(8, 14)&lt;/td&gt;
  &lt;td&gt;(8, 14)&lt;/td&gt;
  &lt;td&gt;(7, 15)&lt;/td&gt;
  &lt;td&gt;(6, 16)&lt;/td&gt;
  &lt;td&gt;(5, 18)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B957&lt;/td&gt;
  &lt;td&gt;2604&lt;/td&gt;
  &lt;td&gt;36&lt;/td&gt;
  &lt;td&gt;(3, 3)&lt;/td&gt;
  &lt;td&gt;(3, 3)&lt;/td&gt;
  &lt;td&gt;(2, 4)&lt;/td&gt;
  &lt;td&gt;(2, 4)&lt;/td&gt;
  &lt;td&gt;(2, 5)&lt;/td&gt;
  &lt;td&gt;(1, 5)&lt;/td&gt;
  &lt;td&gt;(1, 6)&lt;/td&gt;
  &lt;td&gt;(1, 6)&lt;/td&gt;
  &lt;td&gt;(1, 7)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B958&lt;/td&gt;
  &lt;td&gt;2840&lt;/td&gt;
  &lt;td&gt;29&lt;/td&gt;
  &lt;td&gt;(2, 2)&lt;/td&gt;
  &lt;td&gt;(2, 2)&lt;/td&gt;
  &lt;td&gt;(1, 2)&lt;/td&gt;
  &lt;td&gt;(1, 3)&lt;/td&gt;
  &lt;td&gt;(1, 3)&lt;/td&gt;
  &lt;td&gt;(1, 3)&lt;/td&gt;
  &lt;td&gt;(1, 4)&lt;/td&gt;
  &lt;td&gt;(0, 4)&lt;/td&gt;
  &lt;td&gt;(0, 5)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B961&lt;/td&gt;
  &lt;td&gt;1214&lt;/td&gt;
  &lt;td&gt;36&lt;/td&gt;
  &lt;td&gt;(2, 3)&lt;/td&gt;
  &lt;td&gt;(2, 3)&lt;/td&gt;
  &lt;td&gt;(2, 3)&lt;/td&gt;
  &lt;td&gt;(2, 4)&lt;/td&gt;
  &lt;td&gt;(1, 4)&lt;/td&gt;
  &lt;td&gt;(1, 4)&lt;/td&gt;
  &lt;td&gt;(1, 5)&lt;/td&gt;
  &lt;td&gt;(1, 5)&lt;/td&gt;
  &lt;td&gt;(0, 6)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B962&lt;/td&gt;
  &lt;td&gt;1138&lt;/td&gt;
  &lt;td&gt;41&lt;/td&gt;
  &lt;td&gt;(3, 3)&lt;/td&gt;
  &lt;td&gt;(2, 3)&lt;/td&gt;
  &lt;td&gt;(2, 3)&lt;/td&gt;
  &lt;td&gt;(2, 4)&lt;/td&gt;
  &lt;td&gt;(2, 4)&lt;/td&gt;
  &lt;td&gt;(1, 4)&lt;/td&gt;
  &lt;td&gt;(1, 5)&lt;/td&gt;
  &lt;td&gt;(1, 6)&lt;/td&gt;
  &lt;td&gt;(0, 7)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B963&lt;/td&gt;
  &lt;td&gt;1600&lt;/td&gt;
  &lt;td&gt;71&lt;/td&gt;
  &lt;td&gt;(10, 12)&lt;/td&gt;
  &lt;td&gt;(10, 12)&lt;/td&gt;
  &lt;td&gt;(9, 12)&lt;/td&gt;
  &lt;td&gt;(9, 13)&lt;/td&gt;
  &lt;td&gt;(8, 14)&lt;/td&gt;
  &lt;td&gt;(8, 14)&lt;/td&gt;
  &lt;td&gt;(7, 15)&lt;/td&gt;
  &lt;td&gt;(5, 16)&lt;/td&gt;
  &lt;td&gt;(4, 19)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B966&lt;/td&gt;
  &lt;td&gt;1950&lt;/td&gt;
  &lt;td&gt;80&lt;/td&gt;
  &lt;td&gt;(15, 16)&lt;/td&gt;
  &lt;td&gt;(14, 16)&lt;/td&gt;
  &lt;td&gt;(14, 17)&lt;/td&gt;
  &lt;td&gt;(13, 17)&lt;/td&gt;
  &lt;td&gt;(12, 18)&lt;/td&gt;
  &lt;td&gt;(11, 18)&lt;/td&gt;
  &lt;td&gt;(11, 20)&lt;/td&gt;
  &lt;td&gt;(10, 22)&lt;/td&gt;
  &lt;td&gt;(9, 23)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B967&lt;/td&gt;
  &lt;td&gt;1108&lt;/td&gt;
  &lt;td&gt;47&lt;/td&gt;
  &lt;td&gt;(3, 4)&lt;/td&gt;
  &lt;td&gt;(3, 4)&lt;/td&gt;
  &lt;td&gt;(3, 4)&lt;/td&gt;
  &lt;td&gt;(2, 5)&lt;/td&gt;
  &lt;td&gt;(2, 5)&lt;/td&gt;
  &lt;td&gt;(2, 5)&lt;/td&gt;
  &lt;td&gt;(2, 6)&lt;/td&gt;
  &lt;td&gt;(1, 7)&lt;/td&gt;
  &lt;td&gt;(1, 7)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B968&lt;/td&gt;
  &lt;td&gt;2432&lt;/td&gt;
  &lt;td&gt;52&lt;/td&gt;
  &lt;td&gt;(6, 7)&lt;/td&gt;
  &lt;td&gt;(6, 7)&lt;/td&gt;
  &lt;td&gt;(6, 7)&lt;/td&gt;
  &lt;td&gt;(5, 8)&lt;/td&gt;
  &lt;td&gt;(5, 8)&lt;/td&gt;
  &lt;td&gt;(4, 9)&lt;/td&gt;
  &lt;td&gt;(3, 9)&lt;/td&gt;
  &lt;td&gt;(3, 10)&lt;/td&gt;
  &lt;td&gt;(2, 11)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B971&lt;/td&gt;
  &lt;td&gt;1462&lt;/td&gt;
  &lt;td&gt;49&lt;/td&gt;
  &lt;td&gt;(4, 5)&lt;/td&gt;
  &lt;td&gt;(4, 5)&lt;/td&gt;
  &lt;td&gt;(4, 5)&lt;/td&gt;
  &lt;td&gt;(3, 6)&lt;/td&gt;
  &lt;td&gt;(3, 6)&lt;/td&gt;
  &lt;td&gt;(3, 7)&lt;/td&gt;
  &lt;td&gt;(2, 7)&lt;/td&gt;
  &lt;td&gt;(2, 8)&lt;/td&gt;
  &lt;td&gt;(1, 9)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B972&lt;/td&gt;
  &lt;td&gt;1438&lt;/td&gt;
  &lt;td&gt;75&lt;/td&gt;
  &lt;td&gt;(11, 12)&lt;/td&gt;
  &lt;td&gt;(11, 13)&lt;/td&gt;
  &lt;td&gt;(10, 13)&lt;/td&gt;
  &lt;td&gt;(10, 14)&lt;/td&gt;
  &lt;td&gt;(9, 14)&lt;/td&gt;
  &lt;td&gt;(8, 15)&lt;/td&gt;
  &lt;td&gt;(8, 16)&lt;/td&gt;
  &lt;td&gt;(7, 17)&lt;/td&gt;
  &lt;td&gt;(6, 18)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B974&lt;/td&gt;
  &lt;td&gt;5072&lt;/td&gt;
  &lt;td&gt;54&lt;/td&gt;
  &lt;td&gt;(7, 8)&lt;/td&gt;
  &lt;td&gt;(7, 9)&lt;/td&gt;
  &lt;td&gt;(6, 9)&lt;/td&gt;
  &lt;td&gt;(6, 9)&lt;/td&gt;
  &lt;td&gt;(5, 10)&lt;/td&gt;
  &lt;td&gt;(5, 10)&lt;/td&gt;
  &lt;td&gt;(4, 11)&lt;/td&gt;
  &lt;td&gt;(4, 12)&lt;/td&gt;
  &lt;td&gt;(2, 14)&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;B975&lt;/td&gt;
  &lt;td&gt;1542&lt;/td&gt;
  &lt;td&gt;63&lt;/td&gt;
  &lt;td&gt;(7, 8)&lt;/td&gt;
  &lt;td&gt;(7, 9)&lt;/td&gt;
  &lt;td&gt;(6, 9)&lt;/td&gt;
  &lt;td&gt;(6, 9)&lt;/td&gt;
  &lt;td&gt;(6, 10)&lt;/td&gt;
  &lt;td&gt;(5, 11)&lt;/td&gt;
  &lt;td&gt;(5, 11)&lt;/td&gt;
  &lt;td&gt;(4, 12)&lt;/td&gt;
  &lt;td&gt;(3, 14)&lt;/td&gt;
 &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;span style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;
Reading the last row, subject B975 yielded 1542 reads; in a random subset of 400 reads, there were 63 different species (or more precisely, OTUs). &amp;nbsp;My algorithm predicts that if we look at all 1542 reads, the number of additional species we'll find is between 3 and 14, with 90% confidence.&lt;br /&gt;
&lt;br /&gt;
I have to say that this table fills me with dread. &amp;nbsp;The intervals seem quite small, which is to say that the algorithm is more confident than I am. &amp;nbsp;The 90% CIs seem especially narrow to me; it is hard for me to believe that 90% of them will contain the correct values. &amp;nbsp;Well, I guess that's why Karl Popper called them "&lt;a href="http://en.wikipedia.org/wiki/Bold_hypothesis"&gt;bold hypotheses&lt;/a&gt;". &amp;nbsp;We'll find out soon whether they are bold, or just reckless.&lt;br /&gt;
&lt;br /&gt;
I want to thank Rob Dunn at BBB2 for his help with this project. &amp;nbsp;The &lt;a href="http://code.google.com/p/thinkstats/source/browse/trunk/workspace.thinkstats/ThinkStats/species.py"&gt;code&lt;/a&gt; and data I used to generate these results are available from &lt;a href="http://code.google.com/p/thinkstats/"&gt;this SVN repository&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
EDIT 2-22-13: I ran the predictions again with more simulations. &amp;nbsp;The results are not substantially different. &amp;nbsp;I still haven't looked at the answers.&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/t3CQgGjeE0Q" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/5942125760585287432/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/02/belly-button-biodiversity-part-three.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/5942125760585287432?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/5942125760585287432?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/t3CQgGjeE0Q/belly-button-biodiversity-part-three.html" title="Belly Button Biodiversity: Part Three" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><thr:total>0</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/02/belly-button-biodiversity-part-three.html</feedburner:origLink></entry><entry gd:etag="W/&quot;D0MDR3o9eyp7ImA9WhBTE0k.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-962740035154830947</id><published>2013-02-08T09:44:00.001-08:00</published><updated>2013-02-08T09:44:36.463-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-02-08T09:44:36.463-08:00</app:edited><title>Belly Button Biodiversity: Part Two</title><content type="html">&lt;br /&gt;
This is part two of a series of articles about a Bayesian solution to the &lt;a href="http://en.wikipedia.org/wiki/Species_discovery_curve"&gt;Unseen Species&lt;/a&gt; problem, applied to data from the &lt;a href="http://bbdata.yourwildlife.org/"&gt;Belly Button Biodiversity&lt;/a&gt; project.&lt;br /&gt;
&lt;br /&gt;
In &lt;a href="http://allendowney.blogspot.com/2013/02/belly-button-biodiversity-part-one.html"&gt;Part One&lt;/a&gt; I presented the simplest version of the algorithm, which I think is easy to understand, but slow. &amp;nbsp;In &lt;i&gt;&lt;a href="http://www.greenteapress.com/thinkbayes/html/thinkbayes013.html"&gt;Think Bayes&lt;/a&gt;&lt;/i&gt; I present some ways to optimize it. &amp;nbsp;Now in Part Two I apply the algorithm to real data and generate predictive distributions. &amp;nbsp;In Part Three I will publish the predictions the algorithm generates, and in Part Four I will compare the predictions to actual data.&lt;br /&gt;
&lt;br /&gt;
Background: Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen science project with the goal of identifying bacterial species that can be found in human navels (http://bbdata.yourwildlife.org).&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
The belly button data&lt;/h3&gt;
&lt;a href="" name="belly"&gt;&lt;/a&gt;&lt;br /&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
To get a sense of what the data look like, consider subject B1242, whose sample of 400 reads yielded 61 species with the following counts:&lt;/div&gt;
&lt;pre class="verbatim" style="margin-left: 0ex; margin-right: auto;"&gt;92, 53, 47, 38, 15, 14, 12, 10, 8, 7, 7, 5, 5, 
4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
&lt;/pre&gt;
There are a few dominant species that make up a substantial fraction of the whole, but many species that yielded only a single read. The number of these “singletons” suggests that there are likely to be at least a few unseen species.&lt;br /&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
In the example with lions and tigers, we assume that each animal in the preserve is equally likely to be observed. Similarly, for the belly button data, we assume that each bacterium is equally likely to yield a read.&lt;/div&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
In reality, it is possible that each step in the data-collection process might introduce consistent biases. Some species might be more likely to be picked up by a swab, or to yield identifiable amplicons. So when we talk about the prevalence of each species, we should remember this source of error.&lt;/div&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
I should also acknowledge that I am using the term “species” loosely. First, bacterial species are not well-defined. Second, some reads identify a particular species, others only identify a genus. To be more precise, I should say “operational taxonomic unit”, or OTU.&lt;/div&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
Now let’s process some of the belly button data. I defined a class called&amp;nbsp;&lt;tt&gt;Subject&lt;/tt&gt;&amp;nbsp;to represent information about each subject in the study:&lt;/div&gt;
&lt;pre class="verbatim" style="margin-left: 0ex; margin-right: auto;"&gt;class Subject(object):

    def __init__(self, code):
        self.code = code
        self.species = []
&lt;/pre&gt;
Each subject has a string code, like “B1242”, and a list of (count, species name) pairs, sorted in increasing order by count.&amp;nbsp;&lt;tt&gt;Subject&lt;/tt&gt;&amp;nbsp;provides several methods to make it easy to these counts and species names. You can see the details in&amp;nbsp;&lt;tt&gt;&lt;a href="http://thinkbayes.com/species.py"&gt;http://thinkbayes.com/species.py&lt;/a&gt;&lt;/tt&gt;.&lt;br /&gt;
&lt;blockquote class="figure" style="margin-left: 4ex; margin-right: 4ex;"&gt;
&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;img src="http://www.greenteapress.com/thinkbayes/html/thinkbayes030.png" style="border: 0px;" /&gt;&lt;/div&gt;
&lt;div class="caption" style="margin-left: auto; margin-right: auto; padding-left: 2ex; padding-right: 2ex;"&gt;
&lt;table cellpadding="0" cellspacing="6" style="margin-left: inherit; margin-right: inherit;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align="left" valign="top"&gt;Figure 12.3: Distribution of&amp;nbsp;&lt;tt&gt;n&lt;/tt&gt;&amp;nbsp;for subject B1242.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;a href="" name="species-ndist"&gt;&lt;/a&gt;&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
In addition,&amp;nbsp;&lt;tt&gt;Subject.Process&lt;/tt&gt;&amp;nbsp;creates a suite, specifically a suite of type&amp;nbsp;&lt;tt&gt;Species5&lt;/tt&gt;, which represents the distribution of&amp;nbsp;&lt;tt&gt;n&lt;/tt&gt;&amp;nbsp;and the prevalences after processing the data.&lt;br /&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
It also provides&amp;nbsp;&lt;tt&gt;PlotDistOfN&lt;/tt&gt;, which plots the posterior distribution of&amp;nbsp;&lt;tt&gt;n&lt;/tt&gt;. Figure&amp;nbsp;&lt;a href="http://www.greenteapress.com/thinkbayes/html/thinkbayes013.html#species-ndist" style="color: black;"&gt;12.3&lt;/a&gt;&amp;nbsp;shows this distribution for subject B1242. The probability that there are exactly 61 species, and no unseen species, is nearly zero. The most likely value is 72, with 90% credible interval 66 to 79. At the high end, it is unlikely that there are as many as 87 species.&lt;/div&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
Next we compute the posterior distribution of prevalence for each species.&amp;nbsp;&lt;tt&gt;Species2&lt;/tt&gt;&amp;nbsp;provides&lt;tt&gt;DistOfPrevalence&lt;/tt&gt;:&lt;/div&gt;
&lt;pre class="verbatim" style="margin-left: 0ex; margin-right: auto;"&gt;# class Species2

    def DistOfPrevalence(self, index):
        pmfs = thinkbayes.Pmf()

        for n, prob in zip(self.ns, self.probs):
            beta = self.MarginalBeta(n, index)
            pmf = beta.MakePmf()
            pmfs.Set(pmf, prob)

        mix = thinkbayes.MakeMixture(pmfs)
        return pmfs, mix
&lt;/pre&gt;
&lt;tt&gt;index&lt;/tt&gt;&amp;nbsp;indicates which species we want. For each value of&amp;nbsp;&lt;tt&gt;n&lt;/tt&gt;, we have a different posterior distribution of prevalence.&lt;br /&gt;
&lt;blockquote class="figure" style="margin-left: 4ex; margin-right: 4ex;"&gt;
&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;img src="http://www.greenteapress.com/thinkbayes/html/thinkbayes031.png" style="border: 0px;" /&gt;&lt;/div&gt;
&lt;div class="caption" style="margin-left: auto; margin-right: auto; padding-left: 2ex; padding-right: 2ex;"&gt;
&lt;table cellpadding="0" cellspacing="6" style="margin-left: inherit; margin-right: inherit;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align="left" valign="top"&gt;Figure 12.4: Distribution of prevalences for subject B1242.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;a href="" name="species-prev"&gt;&lt;/a&gt;&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
So the loop iterates through the possible values of&amp;nbsp;&lt;tt&gt;n&lt;/tt&gt;&amp;nbsp;and their probabilities. For each value of&amp;nbsp;&lt;tt&gt;n&lt;/tt&gt;&amp;nbsp;it gets a Beta object representing the marginal distribution for the indicated species. Remember that Beta objects contain the parameters&amp;nbsp;&lt;tt&gt;alpha&lt;/tt&gt;&amp;nbsp;and&amp;nbsp;&lt;tt&gt;beta&lt;/tt&gt;; they don’t have values and probabilities like a Pmf, but they provide&amp;nbsp;&lt;tt&gt;MakePmf&lt;/tt&gt;&amp;nbsp;which generates a discrete approximation to the continuous beta distribution.&lt;br /&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
&lt;tt&gt;pmfs&lt;/tt&gt;&amp;nbsp;is a MetaPmf that contains the distributions of prevalence, conditioned on&amp;nbsp;&lt;tt&gt;n&lt;/tt&gt;.&amp;nbsp;&lt;tt&gt;MakeMixture&lt;/tt&gt;combines the MetaPmf into&amp;nbsp;&lt;tt&gt;mix&lt;/tt&gt;, which combines the conditional distributions into the answer, a single distribution of prevalence.&lt;/div&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
Figure&amp;nbsp;&lt;a href="http://www.greenteapress.com/thinkbayes/html/thinkbayes013.html#species-prev" style="color: black;"&gt;12.4&lt;/a&gt;&amp;nbsp;shows these distributions for the five species with the most reads. The most prevalent species accounts for 23% of the 400 reads, but since there are almost certainly unseen species, the most likely estimate for its prevalence is 20%, with 90% credible interval between 17% and 23%.&lt;/div&gt;
&lt;h3&gt;
Predictive distributions&lt;/h3&gt;
&lt;h2 class="section"&gt;
&lt;a href="" name="toc89"&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;blockquote class="figure" style="margin-left: 4ex; margin-right: 4ex;"&gt;
&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;img src="http://www.greenteapress.com/thinkbayes/html/thinkbayes032.png" style="border: 0px;" /&gt;&lt;/div&gt;
&lt;div class="caption" style="margin-left: auto; margin-right: auto; padding-left: 2ex; padding-right: 2ex;"&gt;
&lt;table cellpadding="0" cellspacing="6" style="margin-left: inherit; margin-right: inherit;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align="left" valign="top"&gt;Figure 12.5: Simulated rarefaction curves for subject B1242.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;a href="" name="species-rare"&gt;&lt;/a&gt;&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
I introduced the hidden species problem in the form of four related questions. We have answered the first two by computing the posterior distribution for&amp;nbsp;&lt;tt&gt;n&lt;/tt&gt;&amp;nbsp;and the prevalence of each species.&lt;br /&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
The other two questions are:&lt;/div&gt;
&lt;ul class="itemize"&gt;
&lt;li class="li-itemize" style="margin: 1ex 0ex;"&gt;If we are planning to collect additional samples, can we predict how many new species we are likely to discover?&lt;/li&gt;
&lt;li class="li-itemize" style="margin: 1ex 0ex;"&gt;How many additional reads are needed to increase the fraction of observed species to a given threshold?&lt;/li&gt;
&lt;/ul&gt;
To answer predictive questions like this we can use the posterior distributions to simulate possible future events and compute predictive distributions for the number of species, and fraction of the total, we are likely to see.&lt;br /&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
The kernel of these simulations looks like this:&lt;/div&gt;
&lt;ol class="enumerate" type="1"&gt;
&lt;li class="li-enumerate" style="margin: 1ex 0ex;"&gt;Choose&amp;nbsp;&lt;tt&gt;n&lt;/tt&gt;&amp;nbsp;from its posterior distribution.&lt;/li&gt;
&lt;li class="li-enumerate" style="margin: 1ex 0ex;"&gt;Choose a prevalence for each species, including possible unseen species, using the Dirichlet distribution.&lt;/li&gt;
&lt;li class="li-enumerate" style="margin: 1ex 0ex;"&gt;Generate a random sequence of future observations.&lt;/li&gt;
&lt;li class="li-enumerate" style="margin: 1ex 0ex;"&gt;Compute the number of new species,&amp;nbsp;&lt;code&gt;num_new&lt;/code&gt;, as a function of the number of additional samples,&amp;nbsp;&lt;tt&gt;k&lt;/tt&gt;.&lt;/li&gt;
&lt;li class="li-enumerate" style="margin: 1ex 0ex;"&gt;Repeat the previous steps and accumulate the joint distribution of&amp;nbsp;&lt;code&gt;num_new&lt;/code&gt;&amp;nbsp;and&amp;nbsp;&lt;tt&gt;k&lt;/tt&gt;.&lt;/li&gt;
&lt;/ol&gt;
And here’s the code.&amp;nbsp;&lt;tt&gt;RunSimulation&lt;/tt&gt;&amp;nbsp;runs a single simulation:&lt;br /&gt;
&lt;pre class="verbatim" style="margin-left: 0ex; margin-right: auto;"&gt;# class Subject

    def RunSimulation(self, num_samples):
        m, seen = self.GetSeenSpecies()
        n, observations = self.GenerateObservations(num_samples)

        curve = []
        for k, obs in enumerate(observations):
            seen.add(obs)

            num_new = len(seen) - m
            curve.append((k+1, num_new))

        return curve
&lt;/pre&gt;
&lt;code&gt;num_samples&lt;/code&gt;&amp;nbsp;is the number of additional samples to simulate.&amp;nbsp;&lt;tt&gt;m&lt;/tt&gt;&amp;nbsp;is the number of seen species, and&amp;nbsp;&lt;tt&gt;seen&lt;/tt&gt;&amp;nbsp;is a set of strings with a unique name for each species.&amp;nbsp;&lt;tt&gt;n&lt;/tt&gt;&amp;nbsp;is a random value from the posterior distribution, and&amp;nbsp;&lt;tt&gt;observations&lt;/tt&gt;&amp;nbsp;is a random sequence of species names.&lt;br /&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
The result of&amp;nbsp;&lt;tt&gt;RunSimulation&lt;/tt&gt;&amp;nbsp;is a “rarefaction curve”, represented as a list of pairs with the number of samples and the number of new species seen.&lt;/div&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
Before we see the results, let’s look at&amp;nbsp;&lt;tt&gt;GetSeenSpecies&lt;/tt&gt;&amp;nbsp;and&amp;nbsp;&lt;tt&gt;GenerateObservations&lt;/tt&gt;.&lt;/div&gt;
&lt;pre class="verbatim" style="margin-left: 0ex; margin-right: auto;"&gt;#class Subject

    def GetSeenSpecies(self):
        names = self.GetNames()
        m = len(names)
        seen = set(SpeciesGenerator(names, m))
        return m, seen
&lt;/pre&gt;
&lt;tt&gt;GetNames&lt;/tt&gt;&amp;nbsp;returns the list of species names that appear in the data files, but for many subjects these names are not unique. So I use&amp;nbsp;&lt;tt&gt;SpeciesGenerator&lt;/tt&gt;&amp;nbsp;to extend each name with a serial number:&lt;br /&gt;
&lt;pre class="verbatim" style="margin-left: 0ex; margin-right: auto;"&gt;def SpeciesGenerator(names, num):
    i = 0
    for name in names:
        yield '%s-%d' % (name, i)
        i += 1

    while i &amp;lt; num:
        yield 'unseen-%d' % i
        i += 1
&lt;/pre&gt;
Given a name like&amp;nbsp;&lt;tt&gt;Corynebacterium&lt;/tt&gt;,&amp;nbsp;&lt;tt&gt;SpeciesGenerator&lt;/tt&gt;&amp;nbsp;yields&amp;nbsp;&lt;tt&gt;Corynebacterium-1&lt;/tt&gt;. When the list of names is exhausted, it yields names like&amp;nbsp;&lt;tt&gt;unseen-62&lt;/tt&gt;.&lt;br /&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
Here is&amp;nbsp;&lt;tt&gt;GenerateObservations&lt;/tt&gt;:&lt;/div&gt;
&lt;pre class="verbatim" style="margin-left: 0ex; margin-right: auto;"&gt;# class Subject

    def GenerateObservations(self, num_samples):
        n, prevalences = self.suite.Sample()

        names = self.GetNames()
        name_iter = SpeciesGenerator(names, n)

        d = dict(zip(name_iter, prevalences))
        cdf = thinkbayes.MakeCdfFromDict(d)
        observations = cdf.Sample(num_samples)

        return n, observations
&lt;/pre&gt;
Again,&amp;nbsp;&lt;code&gt;num_samples&lt;/code&gt;&amp;nbsp;is the number of additional samples to generate.&amp;nbsp;&lt;tt&gt;n&lt;/tt&gt;&amp;nbsp;and&amp;nbsp;&lt;tt&gt;prevalences&lt;/tt&gt;&amp;nbsp;are samples from the posterior distribution.&lt;br /&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
&lt;tt&gt;cdf&lt;/tt&gt;&amp;nbsp;is a Cdf object that maps species names, including the unseen, to cumulative probabilities. Using a Cdf makes it efficient to generate a random sequence of species names.&lt;/div&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
Finally, here is&amp;nbsp;&lt;tt&gt;Species2.Sample&lt;/tt&gt;:&lt;/div&gt;
&lt;pre class="verbatim" style="margin-left: 0ex; margin-right: auto;"&gt;    def Sample(self):
        pmf = self.DistOfN()
        n = pmf.Random()
        prevalences = self.SampleConditional(n)
        return n, prevalences
&lt;/pre&gt;
And&amp;nbsp;&lt;tt&gt;SampleConditional&lt;/tt&gt;, which generates a sample of prevalences conditioned on&amp;nbsp;&lt;tt&gt;n&lt;/tt&gt;:&lt;br /&gt;
&lt;pre class="verbatim" style="margin-left: 0ex; margin-right: auto;"&gt;# class Species2

    def SampleConditional(self, n):
        params = self.params[:n]
        gammas = numpy.random.gamma(params)
        gammas /= gammas.sum()
        return gammas
&lt;/pre&gt;
We saw this algorithm for generating prevalences previously in&amp;nbsp;&lt;tt&gt;Species2.SampleLikelihood&lt;/tt&gt;.&lt;br /&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
Figure&amp;nbsp;&lt;a href="http://www.greenteapress.com/thinkbayes/html/thinkbayes013.html#species-rare" style="color: black;"&gt;12.5&lt;/a&gt;&amp;nbsp;shows 100 simulated rarefaction curves for subject B1242. I shifted each curve by a random offset so they would not all overlap. By inspection we can estimate that after 400 more samples we are likely to find 2–6 new species.&lt;/div&gt;
&lt;h3&gt;
Joint posterior&lt;/h3&gt;
&lt;h2 class="section"&gt;
&lt;a href="" name="toc90"&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;blockquote class="figure" style="margin-left: 4ex; margin-right: 4ex;"&gt;
&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;img src="http://www.greenteapress.com/thinkbayes/html/thinkbayes033.png" style="border: 0px;" /&gt;&lt;/div&gt;
&lt;div class="caption" style="margin-left: auto; margin-right: auto; padding-left: 2ex; padding-right: 2ex;"&gt;
&lt;table cellpadding="0" cellspacing="6" style="margin-left: inherit; margin-right: inherit;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align="left" valign="top"&gt;Figure 12.6: Distributions of the number of new species conditioned on the number of additional samples.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;a href="" name="species-cond"&gt;&lt;/a&gt;&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
To be more precise, we can use the simulations to estimate the joint distribution of&amp;nbsp;&lt;code&gt;num_new&lt;/code&gt;&amp;nbsp;and&lt;tt&gt;k&lt;/tt&gt;, and from that we can get the distribution of&amp;nbsp;&lt;code&gt;num_new&lt;/code&gt;&amp;nbsp;conditioned on any value of&amp;nbsp;&lt;tt&gt;k&lt;/tt&gt;.&lt;br /&gt;
&lt;pre class="verbatim" style="margin-left: 0ex; margin-right: auto;"&gt;# class Subject

    def MakeJointPredictive(self, curves):
        joint = thinkbayes.Joint()
        for curve in curves:
            for k, num_new in curve:
                joint.Incr((k, num_new))
        joint.Normalize()
        return joint
&lt;/pre&gt;
&lt;tt&gt;MakeJointPredictive&lt;/tt&gt;&amp;nbsp;makes a Joint object, which is a&amp;nbsp;&lt;tt&gt;Pmf&lt;/tt&gt;&amp;nbsp;whose values are tuples.&lt;br /&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
&lt;tt&gt;curves&lt;/tt&gt;&amp;nbsp;is a list of rarefaction curves created by&amp;nbsp;&lt;tt&gt;RunSimulation&lt;/tt&gt;. Each curve contains a list of pairs of&amp;nbsp;&lt;tt&gt;k&lt;/tt&gt;&amp;nbsp;and&amp;nbsp;&lt;code&gt;num_new&lt;/code&gt;.&lt;/div&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
The resulting joint distribution is a map from each pair to its probability of occurring. Given the joint distribution, we can get the distribution of&amp;nbsp;&lt;code&gt;num_new&lt;/code&gt;&amp;nbsp;conditioned on&amp;nbsp;&lt;tt&gt;k&lt;/tt&gt;:&lt;/div&gt;
&lt;pre class="verbatim" style="margin-left: 0ex; margin-right: auto;"&gt;# class Joint

    def Conditional(self, i, j, val):
        pmf = Pmf()
        for vs, prob in self.Items():
            if vs[j] != val: continue
            pmf.Incr(vs[i], prob)

        pmf.Normalize()
        return pmf
&lt;/pre&gt;
&lt;tt&gt;i&lt;/tt&gt;&amp;nbsp;is the index of the variable whose distribution we want;&amp;nbsp;&lt;tt&gt;j&lt;/tt&gt;&amp;nbsp;is the index of the conditional variables, and&amp;nbsp;&lt;tt&gt;val&lt;/tt&gt;&amp;nbsp;is the value the&amp;nbsp;&lt;tt&gt;j&lt;/tt&gt;th variable has to have. You can think of this operation as taking vertical slices out of Figure&amp;nbsp;&lt;a href="http://www.greenteapress.com/thinkbayes/html/thinkbayes013.html#species-rare" style="color: black;"&gt;12.5&lt;/a&gt;.&lt;br /&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
&lt;tt&gt;Subject.MakeConditionals&lt;/tt&gt;&amp;nbsp;takes a list of&amp;nbsp;&lt;tt&gt;ks&lt;/tt&gt;&amp;nbsp;and computes the conditional distribution of&amp;nbsp;&lt;code&gt;num_new&lt;/code&gt;&amp;nbsp;for each&amp;nbsp;&lt;tt&gt;k&lt;/tt&gt;. The result is a list of Cdf objects.&lt;/div&gt;
&lt;pre class="verbatim" style="margin-left: 0ex; margin-right: auto;"&gt;# class Subject

    def MakeConditionals(self, curves, ks):
        joint = self.MakeJointPredictive(curves)

        cdfs = []
        for k in ks:
            pmf = joint.Conditional(1, 0, k)
            pmf.name = 'k=%d' % k
            cdf = thinkbayes.MakeCdfFromPmf(pmf)
            cdfs.append(cdf)

        return cdfs
&lt;/pre&gt;
Figure&amp;nbsp;&lt;a href="http://www.greenteapress.com/thinkbayes/html/thinkbayes013.html#species-cond" style="color: black;"&gt;12.6&lt;/a&gt;&amp;nbsp;shows the results. After 100 samples, the median predicted number of new species is 2; the 90% credible interval is 0 to 5. After 800 samples, we expect to see 3 to 12 new species.&lt;br /&gt;
&lt;h3&gt;
Coverage&lt;/h3&gt;
&lt;h2 class="section"&gt;
&lt;a href="" name="toc91"&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;blockquote class="figure" style="margin-left: 4ex; margin-right: 4ex;"&gt;
&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;img src="http://www.greenteapress.com/thinkbayes/html/thinkbayes034.png" style="border: 0px;" /&gt;&lt;/div&gt;
&lt;div class="caption" style="margin-left: auto; margin-right: auto; padding-left: 2ex; padding-right: 2ex;"&gt;
&lt;table cellpadding="0" cellspacing="6" style="margin-left: inherit; margin-right: inherit;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align="left" valign="top"&gt;Figure 12.7: Complementary CDF of coverage for a range of additional samples.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;a href="" name="species-frac"&gt;&lt;/a&gt;&lt;div class="center" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;
&lt;hr size="2" width="80%" /&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
The last question we want to answer is, “How many additional reads are needed to increase the fraction of observed species to a given threshold?”&lt;br /&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
To answer this question, we’ll need a version of&amp;nbsp;&lt;tt&gt;RunSimulation&lt;/tt&gt;&amp;nbsp;that computes the fraction of observed species rather than the number of new species.&lt;/div&gt;
&lt;pre class="verbatim" style="margin-left: 0ex; margin-right: auto;"&gt;# class Subject

    def RunSimulation(self, num_samples):
        m, seen = self.GetSeenSpecies()
        n, observations = self.GenerateObservations(num_samples)

        curve = []
        for k, obs in enumerate(observations):
            seen.add(obs)

            frac_seen = len(seen) / float(n)
            curve.append((k+1, frac_seen))

        return curve
&lt;/pre&gt;
Next we loop through each curve and make a dictionary,&amp;nbsp;&lt;tt&gt;d&lt;/tt&gt;, that maps from the number of additional samples,&amp;nbsp;&lt;tt&gt;k&lt;/tt&gt;, to a list of&amp;nbsp;&lt;tt&gt;fracs&lt;/tt&gt;; that is, a list of values for the coverage achieved after&amp;nbsp;&lt;tt&gt;k&lt;/tt&gt;samples.&lt;br /&gt;
&lt;pre class="verbatim" style="margin-left: 0ex; margin-right: auto;"&gt;    def MakeFracCdfs(self, curves):
        d = {}
        for curve in curves:
            for k, frac in curve:
                d.setdefault(k, []).append(frac)

        cdfs = {}
        for k, fracs in d.iteritems():
            cdf = thinkbayes.MakeCdfFromList(fracs)
            cdfs[k] = cdf

        return cdfs
&lt;/pre&gt;
Then for each value of&amp;nbsp;&lt;tt&gt;k&lt;/tt&gt;&amp;nbsp;we make a Cdf of&amp;nbsp;&lt;tt&gt;fracs&lt;/tt&gt;; this Cdf represents the distribution of coverage after&amp;nbsp;&lt;tt&gt;k&lt;/tt&gt;&amp;nbsp;samples.&lt;br /&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
Remember that the CDF tells you the probability of falling below a given threshold, so the&lt;em&gt;complementary&lt;/em&gt;&amp;nbsp;CDF tells you the probability of exceeding it. Figure&amp;nbsp;&lt;a href="http://www.greenteapress.com/thinkbayes/html/thinkbayes013.html#species-frac" style="color: black;"&gt;12.7&lt;/a&gt;&amp;nbsp;shows complementary CDFs for a range of values of&amp;nbsp;&lt;tt&gt;k&lt;/tt&gt;.&lt;/div&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
To read this figure, select the level of coverage you want to achieve along the&amp;nbsp;&lt;i&gt;x&lt;/i&gt;-axis. As an example, choose 90%.&lt;/div&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
Now you can read up the chart to find the probability of achieving 90% coverage after&amp;nbsp;&lt;tt&gt;k&lt;/tt&gt;&amp;nbsp;samples. For example, with 300 samples, you have about a 60% of getting 90% coverage. With 700 samples, you have a 90% chance of getting 90% coverage.&lt;/div&gt;
&lt;div style="margin-bottom: 1em; margin-top: 1em;"&gt;
With that, we have answered the four questions that make up the unseen species problem. Next time: validation!&lt;/div&gt;
&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/ASe8hQlWB3M" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/962740035154830947/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/02/belly-button-biodiversity-part-two.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/962740035154830947?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/962740035154830947?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/ASe8hQlWB3M/belly-button-biodiversity-part-two.html" title="Belly Button Biodiversity: Part Two" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><thr:total>0</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/02/belly-button-biodiversity-part-two.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DEEESHY4eCp7ImA9WhBTEEo.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-1463738396239349312</id><published>2013-02-05T07:03:00.000-08:00</published><updated>2013-02-05T07:03:29.830-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-02-05T07:03:29.830-08:00</app:edited><title>Belly Button Biodiversity: Part One</title><content type="html">This post is a excerpt from &lt;i&gt;Think Bayes: Bayesian Statistics Made Simple&lt;/i&gt;, the book I am working on now. &amp;nbsp;You can read the entire current draft at &lt;a href="http://thinkbayes.com/"&gt;http://thinkbayes.com&lt;/a&gt;.&lt;br /&gt;
&lt;h3&gt;
&lt;br /&gt;&lt;/h3&gt;
&lt;h3&gt;
&lt;span style="font-size: large;"&gt;Belly button bacteria&lt;/span&gt;&lt;/h3&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;&lt;a href="" name="toc80"&gt;&lt;/a&gt;&lt;/span&gt;&lt;/h2&gt;
Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen
science project with the goal of identifying bacterial species that
can be found in human navels (&lt;tt&gt;&lt;a href="http://bbdata.yourwildlife.org/"&gt;http://bbdata.yourwildlife.org&lt;/a&gt;&lt;/tt&gt;).&lt;br /&gt;
&lt;br /&gt;
The project might seem whimsical, but it is part of an increasing
interest in the human microbiome, the set of microorganisms that live
on human skin and other surfaces that contact the environment.&lt;br /&gt;
&lt;br /&gt;
In their pilot study, BBB2 researchers collected swabs from the navels
of 60 volunteers, used multiplex pyrosequencing to extract and sequence
fragments of 16S rDNA, then identified the species or genus the
fragments came from. Each identified fragment is called a “read.”&lt;br /&gt;
&lt;br /&gt;
We can use these data to answer several related questions:&lt;br /&gt;
&lt;ul class="itemize"&gt;
&lt;li class="li-itemize"&gt;Based on the number of species observed, can we estimate
the total number of species in the environment?&lt;/li&gt;
&lt;li class="li-itemize"&gt;Can we estimate the prevalence of each species; that is, the
fraction of the total population belonging to each species?&lt;/li&gt;
&lt;li class="li-itemize"&gt;If we are planning to collect additional samples, can we predict
how many new species we are likely to discover?&lt;/li&gt;
&lt;li class="li-itemize"&gt;How many additional reads are needed to increase the
fraction of observed species to a given threshold?&lt;/li&gt;
&lt;/ul&gt;
These questions make up what is called the “unseen species problem.”&lt;br /&gt;
&lt;h3&gt;
&lt;br /&gt;&lt;/h3&gt;
&lt;h3&gt;
&lt;span style="font-size: large;"&gt;Lions and tigers and bears&lt;/span&gt;&lt;/h3&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;&lt;a href="" name="toc81"&gt;&lt;/a&gt;&lt;/span&gt;&lt;/h2&gt;
I’ll start with a simplified version of the problem where we know that
there are exactly three species. Let’s call them lions, tigers and
bears. Suppose we visit a wild animal preserve and see 3 lions, 2
tigers and one bear.&lt;br /&gt;
&lt;br /&gt;
If we have an equal chance of observing any animal in the preserve
then the number of each species we see is governed by the multinomial
distribution. If the prevalence of lions and tigers and bears is
&lt;code&gt;p_lion&lt;/code&gt; and &lt;code&gt;p_tiger&lt;/code&gt; and &lt;code&gt;p_bear&lt;/code&gt;, the likelihood of
seeing 3 lions, 2 tigers and one bear is&lt;br /&gt;
&lt;pre class="verbatim"&gt;&lt;span style="color: #38761d;"&gt;p_lion**3 * p_tiger**2 * p_bear**1&lt;/span&gt;&lt;/pre&gt;
An approach that is tempting, but not correct, is to use beta
distributions, as in Section&amp;nbsp;&lt;a href="http://www.blogger.com/thinkbayes005.html#beta"&gt;4.6&lt;/a&gt;, to describe the prevalence of
each species separately. For example, we saw 3 lions and 3 non-lions;
if we think of that as 3 “heads” and 3 “tails,” then the posterior
distribution of &lt;code&gt;p_lion&lt;/code&gt; is:&lt;br /&gt;
&lt;pre class="verbatim"&gt;    &lt;span style="color: #38761d;"&gt;beta = thinkbayes.Beta()
    beta.Update((3, 3))
    print beta.MaximumLikelihood()
&lt;/span&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;
&lt;/pre&gt;
The maximum likelihood estimate for &lt;code&gt;p_lion&lt;/code&gt; is the observed
rate, 50%. Similarly the MLEs for &lt;code&gt;p_tiger&lt;/code&gt; and &lt;code&gt;p_bear&lt;/code&gt;
are 33% and 17%.&lt;br /&gt;
But there are two problems:&lt;br /&gt;
&lt;ul class="itemize"&gt;
&lt;li class="li-itemize"&gt;We have implicitly used a prior for each species that is uniform
from 0 to 1, but since we know that there are three species, that
prior is not correct. The right prior should have a mean of 1/3,
and there should be zero likelihood that any species has a
prevalence of 100%.&lt;/li&gt;
&lt;li class="li-itemize"&gt;The distributions for each species are not independent, because
the prevalences have to add up to 1. To capture this dependence, we
need a joint distribution for the three prevalences.&lt;/li&gt;
&lt;/ul&gt;
We can use a Dirichlet distribution to solve both of these problems
(see &lt;tt&gt;&lt;a href="http://en.wikipedia.org/wiki/Dirichlet_distribution"&gt;http://en.wikipedia.org/wiki/Dirichlet_distribution&lt;/a&gt;&lt;/tt&gt;). In
the same way we used the beta distribution to describe the
distribution of bias for a coin, we can use a Dirichlet
distribution to describe the joint distribution of &lt;code&gt;p_lion&lt;/code&gt;,
&lt;code&gt;p_tiger&lt;/code&gt; and &lt;code&gt;p_bear&lt;/code&gt;.&lt;br /&gt;
&lt;br /&gt;
The Dirichlet distribution is the multi-dimensional generalization
of the beta distribution. Instead of two possible outcomes, like
heads and tails, the Dirichlet distribution handles any number of
outcomes: in this example, three species.&lt;br /&gt;
&lt;br /&gt;
If there are &lt;tt&gt;n&lt;/tt&gt; outcomes, the Dirichlet distribution is
described by &lt;tt&gt;n&lt;/tt&gt; parameters, written α&lt;i&gt;&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt;.&lt;br /&gt;
Here’s the definition, from &lt;tt&gt;thinkbayes.py&lt;/tt&gt;, of a class that
represents a Dirichlet distribution:&lt;br /&gt;
&lt;pre class="verbatim"&gt;
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;&lt;span style="color: #38761d;"&gt;class Dirichlet(object):

    def __init__(self, n):
        self.n = n
        self.params = numpy.ones(n, dtype=numpy.int)
&lt;/span&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;
&lt;/pre&gt;
&lt;tt&gt;n&lt;/tt&gt; is the number of dimensions; initially the parameters
are all 1. I use a &lt;tt&gt;numpy&lt;/tt&gt; array to store the parameters
so I can take advantage of array operations.&lt;br /&gt;
Given a Dirichlet distribution, the marginal distribution
for each prevalence is a beta distribution, which we can
compute like this:&lt;br /&gt;
&lt;pre class="verbatim"&gt;    &lt;span style="color: #38761d;"&gt;def MarginalBeta(self, i):
        alpha0 = self.params.sum()
        alpha = self.params[i]
        return Beta(alpha, alpha0-alpha)
&lt;/span&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;
&lt;/pre&gt;
&lt;tt&gt;i&lt;/tt&gt; is the index of the marginal distribution we want.
&lt;tt&gt;alpha0&lt;/tt&gt; is the sum of the parameters; &lt;tt&gt;alpha&lt;/tt&gt; is the
parameter for the given species.&lt;br /&gt;
In the example, the prior marginal distribution for each species
is &lt;tt&gt;Beta(1, 2)&lt;/tt&gt;. We can compute the prior means like
this:&lt;br /&gt;
&lt;pre class="verbatim"&gt;    &lt;span style="color: #38761d;"&gt;dirichlet = thinkbayes.Dirichlet(3)
    for i in range(3):
        beta = dirichlet.MarginalBeta(i)
        print beta.Mean()&lt;/span&gt;
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;
&lt;/pre&gt;
As expected, the prior mean prevalence for each species is 1/3.&lt;br /&gt;
&lt;br /&gt;
To update the Dirichlet distribution, we add the number of
observations to each parameter, like this:&lt;br /&gt;
&lt;pre class="verbatim"&gt;    &lt;span style="color: #38761d;"&gt;def Update(self, data):
        m = len(data)
        self.params[:m] += data
&lt;/span&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;
&lt;/pre&gt;
Here &lt;tt&gt;data&lt;/tt&gt; is a sequence of counts in the same order as &lt;tt&gt;params&lt;/tt&gt;, so in this example, it should be the number of lions,
tigers and bears.&lt;br /&gt;
&lt;br /&gt;
But &lt;tt&gt;data&lt;/tt&gt; can be shorter than &lt;tt&gt;params&lt;/tt&gt;; in that
case there are some hypothetical species that have not been
observed.&lt;br /&gt;
&lt;br /&gt;
Here’s code that updates &lt;tt&gt;dirichlet&lt;/tt&gt; with the observed data and
computes the posterior marginal distributions.&lt;br /&gt;
&lt;pre class="verbatim"&gt;    &lt;span style="color: #38761d;"&gt;data = [3, 2, 1]
    dirichlet.Update(data)

    for i in range(3):
        beta = dirichlet.MarginalBeta(i)
        pmf = beta.MakePmf()
        print i, pmf.Mean()&lt;/span&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;
&lt;/pre&gt;
This figure&amp;nbsp;shows the results. The posterior
mean prevalences are 44%, 33% and 22%.&lt;br /&gt;
&lt;br /&gt;
&lt;a href="http://1.bp.blogspot.com/-ZNO_k3iME2M/UREcnMVuPHI/AAAAAAAABBA/2ElpjVh9QwE/s1600/species1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"&gt;&lt;img border="0" height="300" src="http://1.bp.blogspot.com/-ZNO_k3iME2M/UREcnMVuPHI/AAAAAAAABBA/2ElpjVh9QwE/s400/species1.png" width="400" /&gt;&lt;/a&gt;&lt;br /&gt;
&lt;h3&gt;
&lt;br /&gt;&lt;/h3&gt;
&lt;h3&gt;
&lt;span style="font-size: large;"&gt;A hierarchical model&lt;/span&gt;&lt;/h3&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;&lt;a href="" name="toc82"&gt;&lt;/a&gt;&lt;/span&gt;&lt;/h2&gt;
We have solved a simplified version of the problem: if we
know how many species there are, we can estimate the prevalence
of each.&lt;br /&gt;
&lt;br /&gt;
Now let’s get back to the original problem, estimating the total
number of species. To solve this problem I’ll define a metasuite,
which is a Suite that contains other Suites as hypotheses. In this
case, the top-level Suite contains hypotheses about the number of
species; the bottom level contains hypotheses about prevalences.
A multi-level model like this is called “hierarchical.”&lt;br /&gt;
Here’s the class definition:&lt;br /&gt;
&lt;pre class="verbatim"&gt;&lt;span style="color: #38761d;"&gt;class Species(thinkbayes.Suite):

    def __init__(self, ns):
        hypos = [thinkbayes.Dirichlet(n) for n in ns]
        thinkbayes.Suite.__init__(self, hypos)&lt;/span&gt;
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;
&lt;/pre&gt;
&lt;code&gt;__init__&lt;/code&gt; takes a list of possible values for &lt;tt&gt;n&lt;/tt&gt; and
makes a list of Dirichlet objects.&lt;br /&gt;
&lt;br /&gt;
Here’s the code that creates the top-level suite:&lt;br /&gt;
&lt;pre class="verbatim"&gt;    &lt;span style="color: #38761d;"&gt;ns = range(3, 30)
    suite = Species(ns)
&lt;/span&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;
&lt;/pre&gt;
&lt;tt&gt;ns&lt;/tt&gt; is the list of possible values for &lt;tt&gt;n&lt;/tt&gt;. We have seen 3
species, so there have to be at least that many. I chose an upper
bound that seemed reasonable, but we will have to check later that the
probability of exceeding this bound is low. And at least initially
we assume that any value in this range is equally likely.&lt;br /&gt;
&lt;br /&gt;
To update a hierarchical model, you have to update all levels.
Sometimes it is necessary or more efficient to update the bottom
level first and work up. In this case it doesn’t matter, so
I update the top level first:&lt;br /&gt;
&lt;pre class="verbatim"&gt;&lt;span style="color: #38761d;"&gt;#class Species

    def Update(self, data):
        thinkbayes.Suite.Update(self, data)
        for hypo in self.Values():
            hypo.Update(data)
&lt;/span&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;
&lt;/pre&gt;
&lt;tt&gt;Species.Update&lt;/tt&gt; invokes &lt;tt&gt;Update&lt;/tt&gt; in the parent class,
then loops through the sub-hypotheses and updates them.&lt;br /&gt;
&lt;br /&gt;
Now all we need is a likelihood function. As usual,
&lt;tt&gt;Likelihood&lt;/tt&gt; gets a hypothesis and a dataset as arguments:&lt;br /&gt;
&lt;pre class="verbatim"&gt;&lt;span style="color: #38761d;"&gt;# class Species

    def Likelihood(self, hypo, data):
        dirichlet = hypo
        like = 0
        for i in range(1000):
            like += dirichlet.Likelihood(data)

        return like&lt;/span&gt;
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;
&lt;/pre&gt;
&lt;tt&gt;hypo&lt;/tt&gt; is a Dirichlet object; &lt;tt&gt;data&lt;/tt&gt; is a sequence of
observed counts. &lt;tt&gt;Species.Likelihood&lt;/tt&gt; calls
&lt;tt&gt;Dirichlet.Likelihood&lt;/tt&gt; 1000 times and returns the total.&lt;br /&gt;
&lt;br /&gt;
Why do we have to call it 1000 times? Because &lt;tt&gt;Dirichlet.Likelihood&lt;/tt&gt; doesn’t actually compute the likelihood of the
data under the whole Dirichlet distribution. Instead, it draws one
sample from the hypothetical distribution and computes the likelihood
of the data under the sampled set of prevalences.&lt;br /&gt;
Here’s what it looks like:&lt;br /&gt;
&lt;pre class="verbatim"&gt;&lt;span style="color: #38761d;"&gt;# class Dirichlet

    def Likelihood(self, data):
        m = len(data)
        if self.n &amp;lt; m:
            return 0

        x = data
        p = self.Random()
        q = p[:m]**x
        return q.prod()
&lt;/span&gt;&lt;/pre&gt;
&lt;pre class="verbatim"&gt;
&lt;/pre&gt;
The length of &lt;tt&gt;data&lt;/tt&gt; is the number of species observed. If
we see more species than we thought existed, the likelihood is 0.&lt;br /&gt;
&lt;br /&gt;
Otherwise we select a random set of prevalences, &lt;tt&gt;p&lt;/tt&gt;, and
compute the multinomial PDF, which is
&lt;br /&gt;
&lt;table class="display dcenter"&gt;&lt;tbody&gt;
&lt;tr valign="middle"&gt;&lt;td class="dcell"&gt;&lt;i&gt;c&lt;/i&gt;(&lt;i&gt;x&lt;/i&gt;)&amp;nbsp;&lt;/td&gt;&lt;td class="dcell"&gt;&lt;table class="display"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align="center" class="dcell"&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td align="center" class="dcell"&gt;&lt;span style="font-size: x-large;"&gt;∏&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td align="center" class="dcell"&gt;&lt;i&gt;i&lt;/i&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/td&gt;&lt;td class="dcell"&gt;&amp;nbsp;&lt;i&gt;p&lt;sub&gt;i&lt;/sub&gt;&lt;sup&gt;x&lt;sub&gt;i&lt;/sub&gt;&lt;/sup&gt;&lt;/i&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;i&gt;p&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; is the prevalence of the &lt;i&gt;i&lt;/i&gt;th species, and &lt;i&gt;x&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; is the
observed number. The first term, &lt;i&gt;c&lt;/i&gt;(&lt;i&gt;x&lt;/i&gt;), is the multinomial
coefficient; I left it out of the computation because it is
a multiplicative factor that depends only
on the data, not the hypothesis, so it gets normalized away
(see &lt;tt&gt;&lt;a href="http://en.wikipedia.org/wiki/Multinomial_distribution"&gt;http://en.wikipedia.org/wiki/Multinomial_distribution&lt;/a&gt;&lt;/tt&gt;).&lt;br /&gt;
&lt;br /&gt;
Also, I truncated &lt;tt&gt;p&lt;/tt&gt; at &lt;tt&gt;m&lt;/tt&gt;, which is the number of
observed species. For the unseen species, &lt;i&gt;x&lt;sub&gt;i&lt;/sub&gt;&lt;/i&gt; is 0, so
&lt;i&gt;p&lt;sub&gt;i&lt;/sub&gt;&lt;sup&gt;x&lt;sub&gt;i&lt;/sub&gt;&lt;/sup&gt;&lt;/i&gt; is 1, so we can leave them out of the product.&lt;br /&gt;
&lt;h3&gt;
&lt;br /&gt;&lt;/h3&gt;
&lt;h3&gt;
&lt;span style="font-size: large;"&gt;Random sampling&lt;/span&gt;&lt;/h3&gt;
&lt;h2 class="section"&gt;
&lt;span style="font-size: large;"&gt;&lt;a href="" name="toc83"&gt;&lt;/a&gt;&lt;/span&gt;&lt;/h2&gt;
There are two ways to generate a random sample from a Dirichlet
distribution. One is to use the marginal beta distributions, but in
that case you have to select one at a time and scale the rest so they
add up to 1 (see
&lt;tt&gt;&lt;a href="http://en.wikipedia.org/wiki/Dirichlet_distribution#Random_number_generation"&gt;http://en.wikipedia.org/wiki/Dirichlet_distribution#Random_number_generation&lt;/a&gt;&lt;/tt&gt;).&lt;br /&gt;
&lt;br /&gt;
A less obvious, but faster, way is to select values from &lt;tt&gt;n&lt;/tt&gt; gamma
distributions, then normalize by dividing through by the total. 
Here’s the code:&lt;br /&gt;
&lt;pre class="verbatim"&gt;&lt;span style="color: #38761d;"&gt;# class Dirichlet

    def Random(self):
        p = numpy.random.gamma(self.params)
        return p / p.sum()&lt;/span&gt;
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;
&lt;/pre&gt;
Now we’re ready to look at some results. Here is the code that
updates the top-level suite and extracts the posterior PMF of &lt;tt&gt;n&lt;/tt&gt;:&lt;br /&gt;
&lt;pre class="verbatim"&gt;    &lt;span style="color: #38761d;"&gt;data = [3, 2, 1]
    suite.Update(data)
    pmf = suite.DistOfN()&lt;/span&gt;
&lt;/pre&gt;
&lt;pre class="verbatim"&gt;
&lt;/pre&gt;
To get the posterior distribution of &lt;tt&gt;n&lt;/tt&gt;, &lt;tt&gt;DistOfN&lt;/tt&gt; iterates
through the top-level hypotheses:&lt;br /&gt;
&lt;pre class="verbatim"&gt;    &lt;span style="color: #38761d;"&gt;def DistOfN(self):
        pmf = thinkbayes.Pmf()
        for hypo, prob in self.Items():
            pmf.Set(hypo.n, prob)
        return pmf&lt;/span&gt;
&lt;/pre&gt;
&lt;blockquote class="figure"&gt;
&lt;div class="center"&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
This figure&amp;nbsp;shows the result:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-CTslmeZZsyA/UREc5vcsIiI/AAAAAAAABBI/rtT_Lbc9hpE/s1600/species2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://3.bp.blogspot.com/-CTslmeZZsyA/UREc5vcsIiI/AAAAAAAABBI/rtT_Lbc9hpE/s400/species2.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
The most likely value is 5.
Values from 3 to 8 are all likely; after that the probabilities
drop off quickly. The probability that there are 29 species is
low enough to be negligible; if we chose a higher bound, 
we would get the same result.&lt;br /&gt;
&lt;br /&gt;
But remember that we started with a uniform prior for &lt;tt&gt;n&lt;/tt&gt;. If we
have background information about the number of species in the
environment, we might choose a different prior.&lt;br /&gt;
&lt;h3&gt;
&lt;br /&gt;&lt;/h3&gt;
&lt;h3&gt;
&lt;span style="font-size: large;"&gt;Optimization&lt;/span&gt;&lt;/h3&gt;
&lt;h2 class="section"&gt;
&lt;a href="" name="toc84"&gt;&lt;/a&gt;&lt;/h2&gt;
I have to admit that I am proud of this example. The unseen species
problem is not easy, and I think this solution is simple and clear,
and takes surprisingly few lines of code (about 50 so far).&lt;br /&gt;
&lt;br /&gt;
The only problem is that it is slow. It’s good enough for the example
with only 3 observed species, but not good enough for the belly button
data, with more than 100 species in some samples.&lt;br /&gt;
&lt;br /&gt;
In &lt;i&gt;&lt;a href="http://thinkbayes.com/"&gt;Think Bayes&lt;/a&gt;&lt;/i&gt; I present a series of optimizations we need to
make this solution scale. Here’s
a road map of the steps:&lt;br /&gt;
&lt;ul class="itemize"&gt;
&lt;li class="li-itemize"&gt;The first step is to recognize that if we update the Dirichlet
distributions with the same data, the first &lt;tt&gt;m&lt;/tt&gt; parameters are
the same for all of them. The only difference is the number of
hypothetical unseen species. So we don’t really need &lt;tt&gt;n&lt;/tt&gt;
Dirichlet objects; we can store the parameters in the top level of
the hierarchy. &lt;tt&gt;Species2&lt;/tt&gt; implements this optimization.&lt;/li&gt;
&lt;li class="li-itemize"&gt;&lt;tt&gt;Species2&lt;/tt&gt; also uses the same set of random values for all
of the hypotheses. This saves time generating random values, but it
has a second benefit that turns out to be more important: by giving
all hypothesis the same selection from the sample space, we make
the comparison between the hypotheses more fair, so it takes
fewer iterations to converge.&lt;/li&gt;
&lt;li class="li-itemize"&gt;But there is still a major performance problem. As the
number of observed species increases, the array of random
prevalences gets bigger, and the chance of choosing one that is
approximately right becomes small. So the vast majority
of iterations yield small likelihoods that don’t contribute
much to the total, and don’t discriminate between hypotheses.The solution is to do the updates one species at a time. &lt;tt&gt;Species4&lt;/tt&gt; is a simple implementation of this strategy using
Dirichlet objects to represent the sub-hypotheses.&lt;/li&gt;
&lt;li class="li-itemize"&gt;Finally, &lt;tt&gt;Species5&lt;/tt&gt; combines the sub-hypothesis into the top
level and uses &lt;tt&gt;numpy&lt;/tt&gt; array operations to speed things up.&lt;/li&gt;
&lt;/ul&gt;
I won't present the details here. &amp;nbsp;In the next part of this series, I will present results from the Belly Button Biodiversity project.&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/85Q0ysKZ8Ac" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/1463738396239349312/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/02/belly-button-biodiversity-part-one.html#comment-form" title="2 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/1463738396239349312?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/1463738396239349312?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/85Q0ysKZ8Ac/belly-button-biodiversity-part-one.html" title="Belly Button Biodiversity: Part One" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/-ZNO_k3iME2M/UREcnMVuPHI/AAAAAAAABBA/2ElpjVh9QwE/s72-c/species1.png" height="72" width="72" /><thr:total>2</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/02/belly-button-biodiversity-part-one.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CUABSX0_eip7ImA9WhNbGUo.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-226484891922511816</id><published>2013-01-23T12:40:00.003-08:00</published><updated>2013-01-23T12:42:38.342-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-01-23T12:42:38.342-08:00</app:edited><title>Bayesian Statistics Made Simple</title><content type="html">I am happy to announce that I will offer an updated and revised version of my tutorial, &lt;i&gt;Bayesian Statistics Made Simple&lt;/i&gt;, at PyCon 2013.&lt;br /&gt;
&lt;br /&gt;
&lt;a href="https://us.pycon.org/2013/schedule/presentation/21/"&gt;Registration is open now&lt;/a&gt;. &amp;nbsp;Here are the details:&lt;br /&gt;
&lt;br /&gt;
PyCon 2013&lt;br /&gt;
Santa Clara, CA&lt;br /&gt;
&lt;br /&gt;
Wednesday 13 March, 1:20 p.m.–4:40 p.m.&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Bayesian statistics made simple&lt;/b&gt;&lt;br /&gt;
Allen Downey&lt;br /&gt;
&lt;br /&gt;
Audience level:&amp;nbsp;Intermediate&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
DESCRIPTION&lt;br /&gt;
&lt;br /&gt;
An introduction to Bayesian statistics using Python. Bayesian statistics are usually presented mathematically, but many of the ideas are easier to understand computationally. People who know some Python have a head start.&lt;br /&gt;
&lt;br /&gt;
We will use material from &lt;i&gt;Think Stats: Probability and Statistics for Programmers&lt;/i&gt; (O’Reilly Media), and &lt;i&gt;Think Bayes&lt;/i&gt;, a work in progress at &lt;a href="http://thinkbayes.com/"&gt;http://thinkbayes.com&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
ABSTRACT&lt;br /&gt;
&lt;br /&gt;
Bayesian statistical methods are becoming more common and more important, but there are not many resources to help beginners get started. &amp;nbsp;People who know Python can use their programming skills to get a head start.&lt;br /&gt;
&lt;br /&gt;
I will present simple programs that demonstrate the concepts of Bayesian statistics, and apply them to a range of example problems. &amp;nbsp;Participants will work hands-on with example code and practice on example problems.&lt;br /&gt;
&lt;br /&gt;
Students should have at least basic Python and basic statistics. &amp;nbsp;If you learned about Bayes’s Theorem and probability distributions at some time, that’s enough, even if you don’t remember it!&lt;br /&gt;
&lt;br /&gt;
Students should bring a laptop with Python 2.x and matplotlib. &amp;nbsp;You can work in any environment; you just need to be able to download a Python program and run it.&lt;br /&gt;
&lt;br /&gt;&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/C3ZT0bLKzUw" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/226484891922511816/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/01/bayesian-statistics-made-simple.html#comment-form" title="2 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/226484891922511816?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/226484891922511816?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/C3ZT0bLKzUw/bayesian-statistics-made-simple.html" title="Bayesian Statistics Made Simple" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><thr:total>2</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/01/bayesian-statistics-made-simple.html</feedburner:origLink></entry><entry gd:etag="W/&quot;C0YBQXs9eSp7ImA9WhNUFks.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-1250036931403789226</id><published>2013-01-08T08:04:00.000-08:00</published><updated>2013-01-08T08:05:50.561-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-01-08T08:05:50.561-08:00</app:edited><title>Are first babies more likely to be late, revisited.</title><content type="html">Two years ago I wrote an article called &lt;a href="http://allendowney.blogspot.com/2011/02/are-first-babies-more-likely-to-be-late.html"&gt;Are first babies more likely to be late?&lt;/a&gt;, based on a question that came up when my wife and I were expecting our first child. &amp;nbsp;I compared the pool of first babies to the pool of all other babies, and found:&lt;br /&gt;
&lt;div&gt;
&lt;ul&gt;
&lt;li&gt;There is a small difference in the mean pregnancy length for the two groups, about 13 hours, but it is not practically or statistically significant.&lt;/li&gt;
&lt;li&gt;If we group babies into Early, On Time, or Late (where On Time is 38, 39 or 40 weeks), first babies are a little more likely to be Early or Late, and less likely to be On Time.&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
Then yesterday I got the following question from an Unknown correspondent:&lt;/div&gt;
&lt;/div&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;i&gt;While interesting, I can't help but think you need to compare the first and others for the same woman. While may be unlikely it could still be that a tendency exists for a woman's second, third, etc, child comes earlier.&lt;/i&gt;&lt;/blockquote&gt;
&lt;div&gt;
This is an excellent suggestion. &amp;nbsp;It is possible that the variability between people is masking some of the variability between first and later babies. &amp;nbsp;By pairing first and second babies with the same mother, we can control for variation between mothers.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
So I ran that experiment, selecting all mothers with at least two children and computing the difference in pregnancy length between the second and first child (so a positive value means the second child was later). &amp;nbsp;Here is the distribution of these value for 4387 women in the NSFG (National Survey of Family Growth):&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-FZIXmdLDeKM/UOxBik46hOI/AAAAAAAABAg/p4KS2VAYoe8/s1600/first_matched.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://1.bp.blogspot.com/-FZIXmdLDeKM/UOxBik46hOI/AAAAAAAABAg/p4KS2VAYoe8/s400/first_matched.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Visually the distribution looks symmetric, and the summary statistics support that conclusion. &amp;nbsp;The mean is -0.034, which means that (if anything) the second baby is born about 6 hours earlier, but this difference is not statistically significant.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Conclusion: good question, definitely worth running the experiment, but the primary result is the same as what we saw before: no significant difference in the means.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/__VWXoPEFok" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/1250036931403789226/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/01/are-first-babies-more-likely-to-be-late.html#comment-form" title="4 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/1250036931403789226?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/1250036931403789226?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/__VWXoPEFok/are-first-babies-more-likely-to-be-late.html" title="Are first babies more likely to be late, revisited." /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/-FZIXmdLDeKM/UOxBik46hOI/AAAAAAAABAg/p4KS2VAYoe8/s72-c/first_matched.png" height="72" width="72" /><thr:total>4</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/01/are-first-babies-more-likely-to-be-late.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DUMARX8-fyp7ImA9WhNUFUo.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-6129217578837875668</id><published>2013-01-07T08:50:00.000-08:00</published><updated>2013-01-07T08:50:44.157-08:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2013-01-07T08:50:44.157-08:00</app:edited><title>Call for Bayesian case studies</title><content type="html">It's been a while since the last post because I have been hard at work on &lt;i&gt;Think Bayes&lt;/i&gt;. &amp;nbsp;As always, I have been posting drafts as I go along, so you can read the current version at &lt;a href="http://thinkbayes.com/"&gt;thinkbayes.com&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
I am teaching Computational Bayesian Statistics in the spring, using the draft edition of the book. &amp;nbsp;The students will work on case studies, some of which will be included in the book. &amp;nbsp;And then I hope the book will be published as part of the &lt;i&gt;Think X&lt;/i&gt; series (for all &lt;i&gt;X&lt;/i&gt;). &amp;nbsp;At least, that's the plan.&lt;br /&gt;
&lt;br /&gt;
In the next couple of weeks, students will be looking for ideas for case studies. &amp;nbsp;An ideal project has at least some of these characteristics:&lt;br /&gt;
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;An interesting real-world application (preferably not a toy problem).&lt;/li&gt;
&lt;li&gt;Data that is either public or can be made available for use in the case study.&lt;/li&gt;
&lt;li&gt;Permission to publish the case study!&lt;/li&gt;
&lt;li&gt;A problem that lends itself to Bayesian analysis, in particular if there is a practical advantage to generating a posterior distribution rather than a point or interval estimate.&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
Examples in the book include:&lt;/div&gt;
&lt;div&gt;
&lt;ul&gt;
&lt;li&gt;The hockey problem: estimating the rate of goals scored by two hockey teams in order to predict the outcome of a seven-game series.&lt;/li&gt;
&lt;li&gt;The paintball problem, a version of the lighthouse problem. &amp;nbsp;This one verges on being a toy problem, but recasting it in the context of paintball got it over the bar for me.&lt;/li&gt;
&lt;li&gt;The kidney problem. &amp;nbsp;This one is as real as it gets -- it was prompted by a question posted by a cancer patient who needed a statistical estimate of when a tumor formed.&lt;/li&gt;
&lt;li&gt;The unseen species problem: a nice Bayesian solution to a standard problem in ecology.&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
So far I have a couple of ideas prompted by questions on Reddit:&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.reddit.com/r/statistics/comments/15rurz/question_about_continuous_bayesian_inference/"&gt;Estimating the trustworthiness of redditors&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.reddit.com/r/statistics/comments/1647yj/which_regression_technique/"&gt;Bayesian regression&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
But I would love to get more ideas. &amp;nbsp;If you have a problem you would like to contribute, let me know!&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/kof7waZ-Etg" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/6129217578837875668/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2013/01/call-for-bayesian-case-studies.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/6129217578837875668?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/6129217578837875668?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/kof7waZ-Etg/call-for-bayesian-case-studies.html" title="Call for Bayesian case studies" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><thr:total>0</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2013/01/call-for-bayesian-case-studies.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DkYMQX05fSp7ImA9WhJREE8.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-6028977819806744241</id><published>2012-07-11T09:29:00.001-07:00</published><updated>2012-07-11T09:29:40.325-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-07-11T09:29:40.325-07:00</app:edited><title>Secularization in America: part seven</title><content type="html">&lt;br /&gt;
&lt;h3&gt;
&lt;span style="background-color: white;"&gt;Abstract&lt;/span&gt;&lt;/h3&gt;
&lt;div&gt;
&lt;span style="background-color: white;"&gt;Based on 2000-2010 data from the &lt;a href="http://www3.norc.org/gss+website/"&gt;General Social Survey&lt;/a&gt; (GSS), I present results of a logistic regression that measures the relationship between Internet use and religious affiliation, controlling for religious upbringing, income and socioeconomic index, year born (age), and education.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="background-color: white;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="background-color: white;"&gt;I find that moderate Internet use reduces the chance of religious affiliation by 2 percentage points (odds ratio 0.8); heavier Internet use reduces affiliation by an additional 5 percentage points (odds ratio 0.7).&lt;/span&gt;&lt;span style="background-color: white;"&gt; &amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;Four years of college reduces affiliation by an additional 2 percentage points (odds ratio 0.8).&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="background-color: white;"&gt;A&lt;/span&gt;&lt;span style="background-color: white;"&gt;ll reported effects are statistically significant with N=8960 respondents.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="background-color: white;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="background-color: white;"&gt;Results of logistic regression can be difficult to interpret; it might help to imagine the following progression:&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;ol&gt;
&lt;li&gt;Start with a hypothetical baseline person raised in any religion, with moderate or high household income ($25,000 per year or more), born in 1960, with high school education but no college, and low Internet use (less than 2 hours per week): in the GSS survey, 91% of people in this category have a religious affiliation. &amp;nbsp;Now we change one variable at a time.&lt;/li&gt;
&lt;li&gt;If this person were born 10 years later (in 1970) the fraction would drop to 89%.&lt;/li&gt;
&lt;li&gt;If this person went to college, the fraction would drop to 87%&lt;/li&gt;
&lt;li&gt;If this person used the Internet 2 or more hours per week, the fraction would drop to 85%.&lt;/li&gt;
&lt;li&gt;If this person used the Internet 8 or more hours per week, the fraction would drop to 80%.&lt;span style="background-color: white;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
Taken together, college education and Internet use are associated with a decrease in religious affiliation of 9 percentage points.&lt;br /&gt;
&lt;h3&gt;
&lt;span style="background-color: white;"&gt;Introduction&lt;/span&gt;&lt;/h3&gt;
&lt;span style="background-color: white;"&gt;From 1990 to 2010 the fraction of Protestants in the U.S. population dropped from 62% to 51%; at the same time the fraction of people with no religious preference increased from 8% to 18%. &amp;nbsp;The following graph shows these trends:&lt;/span&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-61YLEVDZpiM/T_2ebhSP-MI/AAAAAAAAA70/TYuPz6us_bI/s1600/gss.1972-2010.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://2.bp.blogspot.com/-61YLEVDZpiM/T_2ebhSP-MI/AAAAAAAAA70/TYuPz6us_bI/s400/gss.1972-2010.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;span style="background-color: white;"&gt;In &lt;/span&gt;&lt;a href="http://allendowney.blogspot.com/2012/07/secularization-in-america-part-five.html" style="background-color: white;"&gt;a previous article&lt;/a&gt;&lt;span style="background-color: white;"&gt; I presented evidence that something happened in the 1990s, continuing through the 2000s, that is causing disaffiliation from religion across all generations, with the largest effect on the youngest generations in the survey, people born in the 1960s and 1970s.&lt;/span&gt;&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="background-color: white;"&gt;There are many possible explanations, but for me, the Internet pops to the top of this list. &amp;nbsp;First, the timing is at least approximately right. &amp;nbsp;Here is data from the &lt;/span&gt;&lt;a href="http://data.worldbank.org/indicator/IT.NET.USER.P2" style="background-color: white;"&gt;World Bank&lt;/a&gt;&lt;span style="background-color: white;"&gt;, showing number of Internet users per hundred people in the U.S.&lt;/span&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-Wb0EvfkIYrg/T_yThS-cJaI/AAAAAAAAA7o/m643k0MpNqI/s1600/gss.internet.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://4.bp.blogspot.com/-Wb0EvfkIYrg/T_yThS-cJaI/AAAAAAAAA7o/m643k0MpNqI/s400/gss.internet.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Internet use increased rapidly from 1995 to 2010, which is the interval of steepest change in religious affiliation.&lt;/div&gt;
&lt;h3&gt;
Regressions&lt;/h3&gt;
&lt;div&gt;
To identify factors that contribute to disaffiliation, I ran logistic regressions with the following dependent variable:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;has_relig&lt;/b&gt;: 1 if the respondent reported any religious affiliation when interviewed as an adult, or 0 if the respondent reported "None" (based on the GSS variable RELIG)&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
And these explanatory variables:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;had_relig&lt;/b&gt;: 1 if the respondent reported being raised in a religion, 0 otherwise (based on RELIG16)&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;born_from_1960&lt;/b&gt;: year the respondent was born minus 1960 (based on AGE and survey year). &amp;nbsp;Subtracting 1960 makes it easier to interpret the results of the regression.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;educ_from_12&lt;/b&gt;: number of years of school completed, minus 12 (based on EDUC).&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;somewww&lt;/b&gt;: 1 if the respondent reported using the Internet 2 of more hours per week, 0 otherwise (based on WWWHR, with the threshold chosen near the median)&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;heavywww&lt;/b&gt;: 1 if the respondent uses the Internet more than 8 hours per week, 0 otherwise (threshold chosen near the 75th percentile)&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;SEI&lt;/b&gt;: Socioeconomic index (a measure of occupational prestige developed by the GSS).&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;high_income&lt;/b&gt;: 1 if the respondent reports annual household income of $25,000 or more, which includes 62% of respondents who answered the question.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
I used data from GSS survey years 2000, 2002, 2004, 2006, and 2010 (the relevant questions were not asked in 2008). &amp;nbsp;I excluded respondents who were not asked or did not answer one or more of the questions I used in my analysis.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
It turns out that SEI does not make a contribution that is either statistically or practically significant, so I omit it from the model.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Here are the results of the model as reported by R:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Coefficients:&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Estimate Std. Error z value Pr(&amp;gt;|z|) &amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;(Intercept) &amp;nbsp; &amp;nbsp;-0.164434 &amp;nbsp; 0.094978 &amp;nbsp;-1.731 &amp;nbsp; 0.0834 . &amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;had_relig &amp;nbsp; &amp;nbsp; &amp;nbsp; 2.318141 &amp;nbsp; 0.087372 &amp;nbsp;26.532 &amp;nbsp;&amp;lt; 2e-16 ***&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;high_income &amp;nbsp; &amp;nbsp; 0.166673 &amp;nbsp; 0.072345 &amp;nbsp; 2.304 &amp;nbsp; 0.0212 * &amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;born_from_1960 -0.020161 &amp;nbsp; 0.002128 &amp;nbsp;-9.474 &amp;nbsp;&amp;lt; 2e-16 ***&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;educ_from_12 &amp;nbsp; -0.051850 &amp;nbsp; 0.012228 &amp;nbsp;-4.240 2.23e-05 ***&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;somewww &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;-0.178409 &amp;nbsp; 0.078490 &amp;nbsp;-2.273 &amp;nbsp; 0.0230 * &amp;nbsp;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;heavywww &amp;nbsp; &amp;nbsp; &amp;nbsp; -0.336658 &amp;nbsp; 0.080546 &amp;nbsp;-4.180 2.92e-05 ***&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;---&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;(Dispersion parameter for binomial family taken to be 1)&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; Null deviance: 7860.3 &amp;nbsp;on 8959 &amp;nbsp;degrees of freedom&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Residual deviance: 6872.5 &amp;nbsp;on 8953 &amp;nbsp;degrees of freedom&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;AIC: 6886.5&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Number of Fisher Scoring iterations: 5&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
All explanatory variables are statistically significant: &lt;b&gt;high_income&lt;/b&gt; and &lt;b&gt;somewww&lt;/b&gt; are borderline, both at p=0.02.&amp;nbsp;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The odds ratios and cumulative probabilities are:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; odds &amp;nbsp; &amp;nbsp;cumulative&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ratio &amp;nbsp; probability&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;(Intercept)&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.85&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;46&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; had_relig&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;10.16&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;90&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;high_income&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1.18&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;91&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;born_from_1960&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.82&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;89&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;educ_from_12&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.81&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;87&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; somewww&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.84&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;85&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #38761d; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp;heavywww&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.71&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;80&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
These results are summarized and interpreted in the Abstract, above.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;h3&gt;
Discussion&lt;/h3&gt;
&lt;div&gt;
As always, statistical association does not prove causation, but in this case I think there are reasons to believe that Internet use causes disaffiliation from religion:&lt;/div&gt;
&lt;div&gt;
&lt;ol&gt;
&lt;li&gt;It is easy to imagine how Internet use could allow a person in a homogeneous community to find information about people of other religions (and none), and to interact with them personally. &amp;nbsp;And there is anecdotal evidence that those interactions contribute to religious disaffiliation (for example, numerous personal reports on &lt;a href="http://reddit.com/r/atheism"&gt;reddit.com/r/atheism&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;Conversely it is harder to imagine plausible reasons why disaffiliation might cause increased Internet use (except possibly on Sunday mornings).&lt;/li&gt;
&lt;li&gt;Although it is possible that a third factor causes both disaffiliation and Internet use, that factor would also have to be new, coincidentally rising in prevalence, like the Internet, during the 1990s and 2000s.&lt;/li&gt;
&lt;li&gt;Whatever causes disaffiliation has the strongest effect on the youngest generations, which is consistent with the hypothesis that Internet use during adolescence and young adulthood has the strongest effect on religious affiliation.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;span style="background-color: white;"&gt;So with appropriate caution, I think there is a strong case here for causation, and not just statistical association.&lt;/span&gt;&lt;br /&gt;
&lt;span style="background-color: white;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="background-color: white;"&gt;Furthermore, the magnitude of the effect is large enough to explain a substantial part of the observed changes in religious affiliation. &amp;nbsp;In my next article I will incorporate this regression model into the generational model I presented in &lt;a href="http://allendowney.blogspot.com/2012/07/secularization-in-america-part-six.html"&gt;Part Six&lt;/a&gt;, in order to estimate the effect of Internet use on these trends.&lt;/span&gt;&lt;br /&gt;
&lt;h3&gt;
&lt;span style="background-color: white;"&gt;Summary of previous reports&lt;/span&gt;&lt;/h3&gt;
&lt;span style="background-color: white;"&gt;In&lt;/span&gt;&lt;span style="background-color: white;"&gt;&amp;nbsp;&lt;/span&gt;&lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-one.html" style="background-color: white;"&gt;Part One&lt;/a&gt;&lt;span style="background-color: white;"&gt;&amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;I described trends in market share of major religions in the U.S.: since 1988, the fraction of Protestants dropped from 60% to 51%, and&amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;the fraction of people with no religious affiliation increased from 8% to 18%.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
In&amp;nbsp;&lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-two.html"&gt;Part Two&lt;/a&gt;&amp;nbsp;I used data from the 1988 General Social Survey (GSS) to model transmission of religion from parent to child, and found that the model failed to predict the decrease in Protestants and the increase in Nones that occurred between 1988 and 2010.&lt;br /&gt;
&lt;br /&gt;
In&amp;nbsp;&lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-three.html"&gt;Part Three&lt;/a&gt;&amp;nbsp;I looked at changes, between 1988 and 2008, in the spouse tables (which describe the tendencies of people to marry within their religions), the environment table (which describes parents' decisions about their children's religious upbringing), and the transmission table (which describes the likely outcomes for children raised within each religion). &amp;nbsp;I found that the transmission table has changed substantially since 1988, and accounts for a large part of the observed increase in Nones, but not the decrease in Protestants.&lt;br /&gt;
&lt;br /&gt;
In&amp;nbsp;&lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-four.html"&gt;Part Four&lt;/a&gt;&amp;nbsp;I looked at changes in religiosity over the lifetime of respondents. &amp;nbsp;I tentatively concluded that the differences between generations were larger than changes in affiliation, within generations, over time.&lt;br /&gt;
&lt;br /&gt;
But in&amp;nbsp;&lt;a href="http://allendowney.blogspot.com/2012/07/secularization-in-america-part-five.html"&gt;Part Five&lt;/a&gt;&amp;nbsp;I looked more closely and saw that all generations were becoming more religious, or staying the same, prior to 1990, and that all generations began to disaffiliate during the 1990s, continuing into the 2000s.&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
In&amp;nbsp;&lt;a href="http://allendowney.blogspot.com/2012/07/secularization-in-america-part-six.html"&gt;Part Six&lt;/a&gt;&amp;nbsp;I presented a generational model that retroactively "predicts" the changes we have seen since 1988, and used it to predict how those changes are likely to continue in the next 30 years. &amp;nbsp;I expect the fraction of Protestants to continue to decrease, and the fraction of Nones to increase and overtake Catholic as the second-largest affiliation by 2030.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/5X0RrAm39Kw" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/6028977819806744241/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2012/07/secularization-in-america-part-seven.html#comment-form" title="12 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/6028977819806744241?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/6028977819806744241?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/5X0RrAm39Kw/secularization-in-america-part-seven.html" title="Secularization in America: part seven" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/-61YLEVDZpiM/T_2ebhSP-MI/AAAAAAAAA70/TYuPz6us_bI/s72-c/gss.1972-2010.png" height="72" width="72" /><thr:total>12</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2012/07/secularization-in-america-part-seven.html</feedburner:origLink></entry><entry gd:etag="W/&quot;AkQFSH05eyp7ImA9WhJSGU8.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-4919761051919806222</id><published>2012-07-10T06:51:00.003-07:00</published><updated>2012-07-10T06:51:59.323-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-07-10T06:51:59.323-07:00</app:edited><title>Secularization in America: part six</title><content type="html">&lt;br /&gt;
&lt;h3&gt;
&lt;span style="background-color: white;"&gt;Summary so far&lt;/span&gt;&lt;/h3&gt;
&lt;span style="background-color: white;"&gt;In&lt;/span&gt;&lt;span style="background-color: white;"&gt;&amp;nbsp;&lt;/span&gt;&lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-one.html" style="background-color: white;"&gt;Part One&lt;/a&gt;&lt;span style="background-color: white;"&gt;&amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;I described trends in market share of major religions in the U.S.: since 1988, the fraction of Protestants dropped from 60% to 51%, and&amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;the fraction of people with no religious affiliation increased from 8% to 18%.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
In&amp;nbsp;&lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-two.html"&gt;Part Two&lt;/a&gt;&amp;nbsp;I used data from the 1988 General Social Survey (GSS) to model transmission of religion from parent to child, and found that the model failed to predict the decrease in Protestants and the increase in Nones that occurred between 1988 and 2010.&lt;br /&gt;
&lt;br /&gt;
In&amp;nbsp;&lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-three.html"&gt;Part Three&lt;/a&gt;&amp;nbsp;I looked at changes, between 1988 and 2008, in the spouse tables (which describe the tendencies of people to marry within their religions), the environment table (which describes parents' decisions about their children's religious upbringing), and the transmission table (which describes the likely outcomes for children raised within each religion). &amp;nbsp;I found that the transmission table has changed substantially since 1988, and accounts for a large part of the observed increase in Nones, but not the decrease in Protestants.&lt;br /&gt;
&lt;br /&gt;
In&amp;nbsp;&lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-four.html"&gt;Part Four&lt;/a&gt;&amp;nbsp;I looked at changes in religiosity over the lifetime of respondents. &amp;nbsp;I tentatively concluded that the differences between generations were larger than changes in affiliation, within generations, over time.&lt;br /&gt;
&lt;br /&gt;
But in &lt;a href="http://allendowney.blogspot.com/2012/07/secularization-in-america-part-five.html"&gt;Part Five&lt;/a&gt; I looked more closely and saw that all generations were becoming more religious, or staying the same, prior to 1990, and that all generations began to disaffiliate during the 1990s, continuing into the 2000s.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
Generational Model&lt;/h3&gt;
Now I am ready to get back to the generational model I have been working up to. &amp;nbsp;The goal of the generational model is to separate these three effects:&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Changes in religious preference from one generation to the next.&lt;/li&gt;
&lt;li&gt;Changes in religious affiliation over the lifetime of respondents.&lt;/li&gt;
&lt;li&gt;Changes in the composition of the GSS cohort over time.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
The model works by simulation. &amp;nbsp;Assuming that we are starting in 1988, here are the steps:&lt;/div&gt;
&lt;div&gt;
&lt;ol&gt;
&lt;li&gt;Read the survey data from 1988 and resample it. &amp;nbsp;Compute and store the distribution of ages.&lt;/li&gt;
&lt;li&gt;For each respondent, generate a hypothetical child. &amp;nbsp;Use the BirthModel to determine year of birth, the UpbringingModel to determine what religion the child is raised in, and the TransmissionModel to determine what affiliation the child will have as an adult. &amp;nbsp;Details of these models follow.&lt;/li&gt;
&lt;li&gt;Form a combined cohort of parents and simulated children. &amp;nbsp;Since the cohort of parents is a representative sample of the US population, the cohort of simulated children is a representative sample of the population one generation later (based, for now, on the simplifying assumptions that all groups have the same number of children on average, and there is no immigration).&lt;/li&gt;
&lt;li&gt;In order to generate a cohort from a future survey year, draw a sample from the combined cohort, weighted so that the distribution of ages in the future year is the same as the original distribution of ages in 1988. &amp;nbsp;As the simulation goes forward in time, this generated cohort contains fewer of the parents and more of the simulated children. &amp;nbsp;After 20 years, about 25% of the "real" respondents have been replaced with "fake" respondents.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
Now, where do all these auxiliary models come from?&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;BirthModel&lt;/b&gt;: This is just the distribution of parent's age when each child is born. &amp;nbsp;It is based on data from the 1994 GSS, which includes questions about children. &amp;nbsp;I had to do some work to correct for an obvious bias due to the ages of the respondents; I will skip the details here.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;UpbringingModel&lt;/b&gt;: This is a combination of the SpouseTable and the EnvironmentTable, described in &lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-three.html"&gt;Part Three&lt;/a&gt;. &amp;nbsp;It is a map from the parent's religion to the distribution of possible religions the child might be raised in.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;TransmissionModel&lt;/b&gt;: This is the TransmissionTable described in Part Three. &amp;nbsp;It is a map from the religious environment of the child to the distribution of religious affiliation reported by the child as an adult.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The Upbringing and Transmission models come in two flavors:&lt;/div&gt;
&lt;div&gt;
&amp;nbsp;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;Time invariant&lt;/b&gt;: We use all respondents to estimate the parameters of the model, and apply the same model to generate all simulated children.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;b&gt;Time variant&lt;/b&gt;: We estimate different parameters for each generation (partitioned by decade born) and use &amp;nbsp;different models to generate simulated children, depending on what year they are born.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
For the time variant model, we have to extrapolate from observed data into the future. &amp;nbsp;To keep this simple we copy the latest reliable data (based on sample size) and apply it to people born in later decades.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Ok, that's enough methodology for now. &amp;nbsp;Let's take a look at some...&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;h3&gt;
Results&lt;/h3&gt;
&lt;div&gt;
The first step is to validate the model by showing that it can predict the observed changes using past data. &amp;nbsp;Here&amp;nbsp;&lt;span style="background-color: white;"&gt;I mean "predict" in a peculiar sense, which is that I will use the entire dataset (including data after 1988) to build the auxiliary models, then use the simulator to generate trends from 1988 to 2010.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Here is what the results look like&lt;span style="background-color: white;"&gt;:&lt;/span&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-UwOeVI5_Q4A/T_svNdMfmqI/AAAAAAAAA6g/L9RbwyfHZGY/s1600/gss.model.1.pcn.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://4.bp.blogspot.com/-UwOeVI5_Q4A/T_svNdMfmqI/AAAAAAAAA6g/L9RbwyfHZGY/s400/gss.model.1.pcn.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-lrkvdwtovCo/T_sv-0ZYwNI/AAAAAAAAA6o/0Au0cyfDjnU/s1600/gss.model.1.jo.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://1.bp.blogspot.com/-lrkvdwtovCo/T_sv-0ZYwNI/AAAAAAAAA6o/0Au0cyfDjnU/s400/gss.model.1.jo.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The thick lines are the observed data; the thin lines are simulations. &amp;nbsp;Here are my observations:&lt;/div&gt;
&lt;div&gt;
&lt;ol&gt;
&lt;li&gt;For Jews and Catholics, the observed data falls within the bounds of the simulations, so the model validates.&lt;/li&gt;
&lt;li&gt;For Other, the observed data sometimes exceeds the bounds of the simulations, which may be due to immigration (not included in this model).&lt;/li&gt;
&lt;li&gt;For None, the observed data is at the high end of the range, and for Prot it is at the low end of the range. &amp;nbsp;This is most likely due to the disaffiliation we saw in &lt;a href="http://allendowney.blogspot.com/2012/07/secularization-in-america-part-five.html"&gt;Part Five&lt;/a&gt;, which is only&amp;nbsp;partly captured in this model.&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div&gt;
I conclude that the model is capturing a large part of the observed changes since 1988, but of course I am cheating by using data from after 1988. &amp;nbsp;So these results validate my modeling decisions (what to include and what to leave out) but they don't test the predictive power of the model.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;h3&gt;
Predictive power&lt;/h3&gt;
&lt;div&gt;
To make an honest test, we have to restrict ourselves to data from before 1988. &amp;nbsp;That way we can tell what part of the observed changes would have been predictable in 1988.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Here's what the result looks like:&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-chkWYerm_Rc/T_s2GpvnFhI/AAAAAAAAA60/JI7spVX6auk/s1600/gss.model.1988.pcn.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://3.bp.blogspot.com/-chkWYerm_Rc/T_s2GpvnFhI/AAAAAAAAA60/JI7spVX6auk/s400/gss.model.1988.pcn.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-23VM7oEQ00w/T_s2OwvSb3I/AAAAAAAAA68/r_E3RNXYkvA/s1600/gss.model.1988.oj.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://1.bp.blogspot.com/-23VM7oEQ00w/T_s2OwvSb3I/AAAAAAAAA68/r_E3RNXYkvA/s400/gss.model.1988.oj.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
So if we had used this model in 1988, we would have predicted a small decrease in the fraction of Protestants and a small increase in None, but we would have underestimated both trends.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
This supports my conclusion in &lt;a href="http://allendowney.blogspot.com/2012/07/secularization-in-america-part-five.html"&gt;Part Five&lt;/a&gt; that something happened in the 1990s that changed trends in religious affiliation, and suggests that these changes were unpredictable based on data observable before 1988.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;h3&gt;
Predictions&lt;/h3&gt;
&lt;div&gt;
Finally, we can use all data to build the models, use 2010 as the starting place for the simulations, and make some predictions for the next 30 years:&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-z3Wx5FGuTk8/T_s-iQYeM3I/AAAAAAAAA7I/n5yEDEBFHMk/s1600/gss.model.2010.2040.pcn.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://4.bp.blogspot.com/-z3Wx5FGuTk8/T_s-iQYeM3I/AAAAAAAAA7I/n5yEDEBFHMk/s400/gss.model.2010.2040.pcn.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-2hdGuWgc_eQ/T_s-xefN88I/AAAAAAAAA7Q/MLoodW4wjCM/s1600/gss.model.2010.2040.oj.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://3.bp.blogspot.com/-2hdGuWgc_eQ/T_s-xefN88I/AAAAAAAAA7Q/MLoodW4wjCM/s400/gss.model.2010.2040.oj.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
So what should we expect?&lt;/div&gt;
&lt;div&gt;
&lt;ol&gt;
&lt;li&gt;The decline in fraction of Protestants will continue. &amp;nbsp;The fraction of Catholics will also decrease, but more slowly.&lt;/li&gt;
&lt;li&gt;The fraction of Nones will increase, overtaking Catholics as the second-largest religious affiliation around 2030.&lt;/li&gt;
&lt;li&gt;The fraction of Others will increase slowly, about 1 percentage point in 30 years. &amp;nbsp;If immigration from Asia continues at current rates, that would add another percentage point, bringing the total to 6%.&lt;/li&gt;
&lt;li&gt;The fraction of Jews will decrease, possibly by half by 2040.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
These predictions are likely to be conservative; that is, the rate of secularization will almost certainly be faster. &amp;nbsp;Why?&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;ol&gt;
&lt;li&gt;Over the last several generations, the UpbringingModel and the TransmissionModel have changed substantially. &amp;nbsp;Parents are less likely to raise their children with religion, and those children are less likely to adopt the religion they are raised with. &amp;nbsp;The model captures these trends, but assumes that they will level off in 2010. &amp;nbsp;It would probably be more accurate to assume that they will continue.&lt;/li&gt;
&lt;li&gt;Rates of disaffiliation among adults are also increasing. &amp;nbsp;Again, the model includes trends that have already occurred, but it assumes that they will level off rather than continue.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
So there are reasons to expect the fraction of Nones to accelerate.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Conversely, it is hard to imagine that the trends will be any slower than these predictions. &amp;nbsp;&lt;span style="background-color: white;"&gt;To a large extent, these results are not predictions about things that will happen in the future; rather, they are the future consequences of things that have already happened. &amp;nbsp;For example, in 2020, the GSS survey will include a cohort of people in their 40s. &amp;nbsp;What will they be like? &amp;nbsp;They will be a lot like the people in the 2010 survey who are in their 30s. &amp;nbsp;But they will be older. &amp;nbsp;Changes in the general population are slow because is takes a long time to replace each generation with the next; but as a result, they are also predictable.&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Next time: Was &lt;a href="http://www.youtube.com/watch?v=20pjeeQ611s"&gt;Rick Santorum&lt;/a&gt; right?&lt;span style="background-color: white;"&gt;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;Is college the #1 enemy of religious belief? &amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;(Hint: no.) &amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;I will look more closely at the TransmissionModel to see what factors make vertical transmission of religion more (or less) likely.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="background-color: white;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;br /&gt;&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/QgoA0lUNegE" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/4919761051919806222/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2012/07/secularization-in-america-part-six.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/4919761051919806222?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/4919761051919806222?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/QgoA0lUNegE/secularization-in-america-part-six.html" title="Secularization in America: part six" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/-UwOeVI5_Q4A/T_svNdMfmqI/AAAAAAAAA6g/L9RbwyfHZGY/s72-c/gss.model.1.pcn.png" height="72" width="72" /><thr:total>0</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2012/07/secularization-in-america-part-six.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CEQAQXgyfyp7ImA9WhJSGEg.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-2358524410634109286</id><published>2012-07-09T09:45:00.000-07:00</published><updated>2012-07-09T09:45:40.697-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-07-09T09:45:40.697-07:00</app:edited><title>Secularization in America: part five</title><content type="html">&lt;br /&gt;
&lt;h3&gt;
&lt;span style="background-color: white;"&gt;Summary so far&lt;/span&gt;&lt;/h3&gt;
&lt;span style="background-color: white;"&gt;In&lt;/span&gt;&lt;span style="background-color: white;"&gt;&amp;nbsp;&lt;/span&gt;&lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-one.html" style="background-color: white;"&gt;Part One&lt;/a&gt;&lt;span style="background-color: white;"&gt;&amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;I described trends in market share of major religions in the U.S.: since 1988, the fraction of Protestants dropped from 60% to 51%, and&amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;the fraction of people with no religious affiliation increased from 8% to 18%.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
In&amp;nbsp;&lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-two.html"&gt;Part Two&lt;/a&gt;&amp;nbsp;I used data from the 1988 General Social Survey (GSS) to model transmission of religion from parent to child, and found that the model failed to predict the decrease in Protestants and the increase in Nones that occurred between 1988 and 2010.&lt;br /&gt;
&lt;br /&gt;
In&amp;nbsp;&lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-three.html"&gt;Part Three&lt;/a&gt;&amp;nbsp;I looked at changes, between 1988 and 2008, in the spouse tables (which describe the tendencies of people to marry within their religions), the environment table (which describes parents' decisions about their children's religious upbringing), and the transmission table (which describes the likely outcomes for children raised within each religion). &amp;nbsp;I found that the transmission table has changed substantially since 1988, and accounts for a large part of the observed increase in Nones, but not the decrease in Protestants.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
Part Four revisited&lt;/h3&gt;
In &lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-four.html"&gt;Part Four&lt;/a&gt; I looked at changes in religiosity over the lifetime of respondents. &amp;nbsp;The GSS is not a longitudinal survey, so we can't follow individuals, but we can follow generations (which I partition by decade of birth) over time.&lt;br /&gt;
&lt;br /&gt;
Last time I presented this figure, which shows religiosity (the fraction of respondents with any religious preference) as a function of respondent's age, partitioned by decade of birth, for people who were raised Protestant:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-s7kXZhIHf2E/T_rvp994HxI/AAAAAAAAA5o/1-nD5zGH23c/s1600/gss.religiosity.prot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://4.bp.blogspot.com/-s7kXZhIHf2E/T_rvp994HxI/AAAAAAAAA5o/1-nD5zGH23c/s400/gss.religiosity.prot.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Each line represents a different generation. &amp;nbsp;For example, the red line shows that people born in the 1920s were about 96% likely to report a religious preference when they were interviewed in their 40s, 50s, and 60s, and possibly less likely to be religious when they were in their 80s.&lt;br /&gt;
&lt;br /&gt;
The conclusion I drew from this figure is that the differences between generations are larger than the changes, over time, within each generation. &amp;nbsp;For purposes of modeling I concluded that religious disaffiliation accounts for only a small part of the observed changes in religious identity.&lt;br /&gt;
&lt;br /&gt;
But I was bothered by one feature of these curves: many of them are concave down, and the maximum point in the curves is apparently shifting toward younger ages. &amp;nbsp;I came to suspect that this picture of the data is "out of focus".&lt;br /&gt;
&lt;br /&gt;
We can refocus the image by plotting the date of the survey (rather than the respondent's age) on the x-axis. &amp;nbsp;Here's what that looks like:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-d8ocb_y7yis/T_r4obx6M8I/AAAAAAAAA50/L7py8Wv1-Pc/s1600/gss.religiosity.by.year.prot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://2.bp.blogspot.com/-d8ocb_y7yis/T_r4obx6M8I/AAAAAAAAA50/L7py8Wv1-Pc/s400/gss.religiosity.by.year.prot.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
In this figure, two trends are more apparent: before 1990, most generations were becoming more religious; after 1990, they all became less religious. &amp;nbsp;So it seems clear that the explanation is something that affected all generations at a particular interval in time, not something that affects all people as they age.&lt;br /&gt;
&lt;br /&gt;
We can see these changes more clearly by normalizing each curve with its 1990 value:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-HdKqjfEH8pk/T_r6Ru8mx-I/AAAAAAAAA58/IGCi9cJ13Jw/s1600/gss.religiosity.by.year.normalized.prot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://2.bp.blogspot.com/-HdKqjfEH8pk/T_r6Ru8mx-I/AAAAAAAAA58/IGCi9cJ13Jw/s400/gss.religiosity.by.year.normalized.prot.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
Again, most generation were becoming more religious before 1990; after 1990, all of them became less religious. &amp;nbsp;Among people born in the 1960s, more than 10% lost their religion between 1990 and 2010 (when they were in their 30s and 40s).&lt;br /&gt;
&lt;br /&gt;
Here's the same graph for people raised Catholic:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-c2zOg3kkAk4/T_sDkHwUv1I/AAAAAAAAA6I/Hqt9SvfUp78/s1600/gss.religiosity.by.year.normalized.cath.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://4.bp.blogspot.com/-c2zOg3kkAk4/T_sDkHwUv1I/AAAAAAAAA6I/Hqt9SvfUp78/s400/gss.religiosity.by.year.normalized.cath.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
The general shape is the same: religious affiliation was flat or increasing prior to 1990, and decreasing for almost all generations after 1990.&lt;br /&gt;
&lt;br /&gt;
Since the trends are similar for Catholics and Protestants, we can get a less noisy picture by combining them. &amp;nbsp;Here is the same graph for respondents raised with any religion.&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-cR2ABrtoXMw/T_sHIx7S4gI/AAAAAAAAA6U/fhaeNhHK5Gc/s1600/gss.religiosity.by.year.normalized.any.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://3.bp.blogspot.com/-cR2ABrtoXMw/T_sHIx7S4gI/AAAAAAAAA6U/fhaeNhHK5Gc/s400/gss.religiosity.by.year.normalized.any.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
This figures makes it easier to compare across generations. &amp;nbsp;It appears that more recent generations (born in the 1960s and 1970s) are disaffiliating at higher rates than earlier generations.&lt;br /&gt;
&lt;br class="Apple-interchange-newline" /&gt;[As an aside, this result contradicts one of the primary (and widely-reported) claims of this article: Schwadel,&amp;nbsp;&lt;i&gt;&lt;a href="http://onlinelibrary.wiley.com/doi/10.1111/j.1468-5906.2010.01511.x/abstract"&gt;Period and Cohort Effects on Religious Nonaffiliation and Religious Disaffiliation&lt;/a&gt;. &amp;nbsp;&lt;/i&gt;Schwadel reports that people born in the 1960s and 1970s were disaffiliating at a slower rate than the previous generations. &amp;nbsp;Some reasons my results might be different: Schwadel only had GSS data up to 2006, and he discards people under 30 years of age. &amp;nbsp;So very little data about the youngest generations is included. &amp;nbsp;Also, his results are based on statistical models that (if I understand correctly) don't include time as an explanatory variable, so they cannot account for an event that affects all generations during a particular interval.&lt;span style="background-color: white;"&gt;]&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
All right, it's audience participation time. &amp;nbsp;What happened in the 1990s that caused widespread religious disaffiliation? &amp;nbsp;Remember, idle speculations only. &amp;nbsp;No evidence, please!&lt;br /&gt;
&lt;br /&gt;&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/IGhawPgP8T0" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/2358524410634109286/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2012/07/secularization-in-america-part-five.html#comment-form" title="6 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/2358524410634109286?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/2358524410634109286?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/IGhawPgP8T0/secularization-in-america-part-five.html" title="Secularization in America: part five" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/-s7kXZhIHf2E/T_rvp994HxI/AAAAAAAAA5o/1-nD5zGH23c/s72-c/gss.religiosity.prot.png" height="72" width="72" /><thr:total>6</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2012/07/secularization-in-america-part-five.html</feedburner:origLink></entry><entry gd:etag="W/&quot;A0QAQXc6eCp7ImA9WhJTGEQ.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-1087905700450516133</id><published>2012-06-28T09:02:00.001-07:00</published><updated>2012-06-28T09:02:20.910-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-06-28T09:02:20.910-07:00</app:edited><title>Secularization in America: part four</title><content type="html">&lt;br /&gt;
&lt;h3&gt;
&lt;span style="background-color: white;"&gt;Summary so far&lt;/span&gt;&lt;/h3&gt;
&lt;span style="background-color: white;"&gt;In&lt;/span&gt;&lt;span style="background-color: white;"&gt;&amp;nbsp;&lt;/span&gt;&lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-one.html" style="background-color: white;"&gt;Part One&lt;/a&gt;&lt;span style="background-color: white;"&gt;&amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;I described trends in market share of major religions in the U.S.: since 1988, the fraction of Protestants dropped from 60% to 51%, and&amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;the fraction of people with no religious affiliation increased from 8% to 18%.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
In&amp;nbsp;&lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-two.html"&gt;Part Two&lt;/a&gt;&amp;nbsp;I used data from the 1988 General Social Survey (GSS) to model transmission of religion from parent to child, and found that the model failed to predict the decrease in Protestants and the increase in Nones that occurred between 1988 and 2010.&lt;br /&gt;
&lt;br /&gt;
In &lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-three.html"&gt;Part Three&lt;/a&gt; I looked at changes, between 1988 and 2008, in the spouse tables (which describe the tendencies of people to marry within their religions), the environment table (which describes parents' decisions about their children's religious upbringing), and the transmission table (which describes the likely outcomes for children raised within each religion). &amp;nbsp;I found that the transmission table has changed substantially since 1988, and accounts for a large part of the observed increase in Nones, but not the decrease in Protestants.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
Religiosity curves&lt;/h3&gt;
Respondents in the GSS are surveyed at different ages, so we can get a sense of when people lose their religion (or acquire one). &amp;nbsp;I collected all GSS respondents and partitioned them by the religion they were raised in and&amp;nbsp;&lt;span style="background-color: white;"&gt;the decade they were born. &amp;nbsp;For each of these subgroups, I plotted religiosity (the fraction with some religious preference) as a function of&amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;age when surveyed.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
Here are the curves for people raised Protestant:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-AsA8xW_WBe4/T-xyQ7I2XNI/AAAAAAAAA5E/pTRXIjqBRNM/s1600/gss.religiosity.prot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://1.bp.blogspot.com/-AsA8xW_WBe4/T-xyQ7I2XNI/AAAAAAAAA5E/pTRXIjqBRNM/s400/gss.religiosity.prot.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
In the top right, we see that people born between 1900 and 1910 and raised Protestant were likely to be religious when they were interviewed in their 70s and 80s. &amp;nbsp;In the lower left, we see that people born in the 1980s were less likely to be religious when they were interviewed in their 20s.&lt;br /&gt;
&lt;br /&gt;
For the middle generations, we have a better sense of changes in religiosity over a respondent's lifetime. &amp;nbsp; &amp;nbsp;Several of the curves have an apparent peak in middle age; if this apparent effect is real, the location of the peak may be shifting left.&lt;br /&gt;
&lt;br /&gt;
Overall, these curves are relatively flat, which suggests that respondents are not changing substantially after adulthood (everyone in the GSS is 18 or older).&lt;br /&gt;
&lt;br /&gt;
The curves for Catholics are similar:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-0IWdFAZcARE/T-xz7ADzoLI/AAAAAAAAA5M/ULA5ZPXpiwk/s1600/gss.religiosity.cath.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://3.bp.blogspot.com/-0IWdFAZcARE/T-xz7ADzoLI/AAAAAAAAA5M/ULA5ZPXpiwk/s400/gss.religiosity.cath.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Again, there is a substantial differences between generations, but within each generation, little change over the respondents' lifetimes. &amp;nbsp;People born in the 50s, 60s and 70s might be leaving the church as they age, but it is hard to tell in this plot whether these trends are statistically significant.&lt;br /&gt;
&lt;br /&gt;
Finally, here are the curves for people raised with no religion:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-FtO2Wog5iTs/T-x0pMYgWyI/AAAAAAAAA5U/cW8SxQPj02E/s1600/gss.religiosity.none.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://4.bp.blogspot.com/-FtO2Wog5iTs/T-x0pMYgWyI/AAAAAAAAA5U/cW8SxQPj02E/s400/gss.religiosity.none.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
There are only enough respondents in this category to plot curves for a few generations, and even then, the curves are noisy. &amp;nbsp;Not surprisingly, people raised without religion are less likely to be religious, and recent generation are less religious than their elders. &amp;nbsp;Again, the curves are generally flat, suggesting that people generally do not change religious affiliation as adults.&lt;br /&gt;
&lt;br /&gt;
A possible exception is that people born in the 1970s and raised without religion might be finding religion in their 30s. &amp;nbsp;But this data point is based on a small number of respondents, so it is probably too early to tell.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
Why people switch&lt;/h3&gt;
In 1988 the GSS asked respondents questions about changes in religious affiliation and the reasons for the change. &amp;nbsp;Unfortunately, it looks like this data won't do me much good, because:&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;In many cases where a respondent switched from a religious preference to None, they were not asked why.&lt;/li&gt;
&lt;li&gt;There are so many inconsistencies in the data, I wonder if it might have been mangled.&lt;/li&gt;
&lt;li&gt;Because these questions were only asked once, we can't track trends.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
So that's disappointing.&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;
Modeling a mixed-age cohort&lt;/h3&gt;
One of the challenges of working with GSS data is that the respondents each year are a mixture of people of all ages. &amp;nbsp;From year to year, the oldest generation drops out of the cohort and the youngest generation joins the mix.&lt;br /&gt;
&lt;br /&gt;
So when there is a trend from each generation to the next, as with religious behavior, there is a lag before the trend appears in a GSS time series, and the slope of the trend is much slower.&lt;br /&gt;
&lt;br /&gt;
However, for purposes of prediction, this lag is actually useful. &amp;nbsp;&lt;span style="background-color: white;"&gt;For example, 18 years after a baby boom, there is likely to be a spike in college enrollment; that's not really a prediction about the future; it's just a consequence of something that has already happened.&lt;/span&gt;&lt;br /&gt;
&lt;span style="background-color: white;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="background-color: white;"&gt;Similarly, we already know what most of the GSS cohort will look like next year. &amp;nbsp;It will look like the cohort this year, one year older. &amp;nbsp;The difference is that a few of the oldest respondents are replaced by the next group of 18 year olds.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
In the 2010 cohort, the age range is roughly 20-80. &amp;nbsp;To predict the 2020 cohort, we can:&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Remove respondents older than 80.&lt;/li&gt;
&lt;li&gt;Age the rest of the respondents by 10 years.&lt;/li&gt;
&lt;li&gt;Add a new batch of respondents in their 20s.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
Step 2 might be hard if people were changing religious affiliation as they age, but as we saw above, they generally do not. &amp;nbsp;Step 3 is harder, but there are two reasonable options:&lt;/div&gt;
&lt;div&gt;
&lt;ol&gt;
&lt;li&gt;Conservatively, we can assume that the next generation will be like their immediate predecessors.&lt;/li&gt;
&lt;li&gt;Alternatively, we can extrapolate from current trends. &amp;nbsp;This option is probably better for prediction, but in some ways unsatisfying because it does not explain the cause of the trends, or why we should expect them to continue.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
If we use this method to predict 20 years into the future, we replace about 25% of the cohort with simulated respondents. &amp;nbsp;But since 75% of the prediction is based on simple population aging, it is likely to reliable.&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;br /&gt;&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/ki_GXzcg9kI" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/1087905700450516133/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-four.html#comment-form" title="1 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/1087905700450516133?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/1087905700450516133?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/ki_GXzcg9kI/secularization-in-america-part-four.html" title="Secularization in America: part four" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/-AsA8xW_WBe4/T-xyQ7I2XNI/AAAAAAAAA5E/pTRXIjqBRNM/s72-c/gss.religiosity.prot.png" height="72" width="72" /><thr:total>1</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2012/06/secularization-in-america-part-four.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DEAHRHs-eSp7ImA9WhJTF04.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-8052625774468037067</id><published>2012-06-26T07:31:00.000-07:00</published><updated>2012-06-26T11:52:15.551-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-06-26T11:52:15.551-07:00</app:edited><title>The falling slinky problem</title><content type="html">&lt;div style="text-align: left;"&gt;
Let's take a break from statistics and do some physics!&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
My friend Ted Bunn recently wrote about the falling slinky problem &lt;a href="http://blog.richmond.edu/physicsbunn/2012/06/25/fun-for-a-girl-and-a-boy/"&gt;in his blog&lt;/a&gt;. &amp;nbsp;He points to &lt;a href="http://www.youtube.com/watch?list=UUHnyfMqiRRG1u-2MsSQLbXA&amp;amp;feature=player_embedded&amp;amp;v=uiyMuHuCFo4"&gt;this video&lt;/a&gt;, which shows a falling slinky in slow motion. &amp;nbsp;After the top of the slinky is released, the bottom seems to hover until the top reaches it. &amp;nbsp;The effect is particularly strange because if you look carefully, the top of the slinky does not accelerate as we expect for an object in free fall. &amp;nbsp;Rather, it falls at a constant rate.&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
Ted explains:&lt;/div&gt;
&lt;blockquote class="tr_bq" style="text-align: left;"&gt;
&lt;span style="background-color: white; line-height: 17px; text-align: justify;"&gt;&lt;span style="font-family: inherit;"&gt;&lt;i&gt;...the information that the top end has been dropped can’t propagate down the slinky any faster than the speed of sound in the slinky (i.e., the speed at which waves propagate down it), so there’s a delay before the bottom end “knows” it’s been dropped. But it’s surprising (at least to me) to see how long the delay is.&lt;/i&gt;&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="background-color: white; line-height: 17px; text-align: justify;"&gt;&lt;span style="font-family: inherit;"&gt;This explains why there is a delay, but to me it doesn't explain why the delay is the same as the time it takes for the top of the slinky to reach the bottom. &amp;nbsp;There are lots of models out there that explain parts of this behavior, but the ones I found are either &lt;a href="http://physics.umd.edu/lecdem/services/refs_scanned_WIP/3%20-%20Vinit's%20LECDEM/C462/3/GetPDFServlet.pdf"&gt;complicated&lt;/a&gt; or &lt;a href="http://tpt.aapt.org/resource/1/phteah/v39/i2/p90_s1"&gt;wrong&lt;/a&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="background-color: white; line-height: 17px; text-align: justify;"&gt;&lt;span style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;Here's my take on it. &amp;nbsp;First, let's assume that what we see in the video is correct: the slinky collapses from top to bottom, so that each coil doesn't move until the one above it comes down and (nearly) hits it.&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;Let's call the initial length L and the mass m. &amp;nbsp;After some time, a fraction of the slinky, x, has collapsed. &amp;nbsp;At that point, the collapsed part of the slinky has mass xm at height (1-x)L. &amp;nbsp;The rest of the slinky is spread uniformly [EDIT: this assumption is not right...see Ted's comment below] between height 0 and (1-x)L. &amp;nbsp;So the center of mass is&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;x(1-x)L + (1-x)(1-x)L/2&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="background-color: white; line-height: 17px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="background-color: white; line-height: 17px;"&gt;Since the slinky is in free fall, we know the center of mass as a function of time:&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;L/2 - g/2 t^2&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;If we set those equal and type them into &lt;a href="http://www.wolframalpha.com/input/?i=x%281-x%29L+%2B+%281-x%29%281-x%29L%2F2+%3D+L%2F2+-+g%2F2+t%5E2"&gt;WolframAlpha&lt;/a&gt;, we get&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;x = sqrt(g/L) t&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;Which means that the top of the slinky is moving at constant speed. &amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white; line-height: 17px;"&gt;Remember that x is the fraction of the slinky that collapsed; to get the distance traveled, we multiply by L:&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;d = xL = sqrt(gL) t&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;So the speed of the top of the slinky is sqrt(gL).&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;We can get to the same result a different way by using the formula for wave speed in a &lt;a href="http://en.wikipedia.org/wiki/Vibrating_string"&gt;vibrating string&lt;/a&gt;:&amp;nbsp;sqrt(T/&lt;/span&gt;&lt;span style="line-height: 20px; text-align: -webkit-auto;"&gt;&lt;span style="font-family: inherit;"&gt;μ&lt;/span&gt;&lt;/span&gt;&lt;span style="line-height: 17px;"&gt;), where T is tension and&amp;nbsp;&lt;/span&gt;&lt;span style="line-height: 20px; text-align: -webkit-auto;"&gt;μ&lt;/span&gt;&lt;span style="line-height: 17px;"&gt;&amp;nbsp;is mass per linear measure. &amp;nbsp;In this case T=mg and&amp;nbsp;&lt;/span&gt;&lt;span style="line-height: 20px; text-align: -webkit-auto;"&gt;μ&lt;/span&gt;&lt;span style="line-height: 17px;"&gt;=m/L. &amp;nbsp;Plug that in and get wave speed sqrt(gL).&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;I think this analysis is useful, but to be rigorous, I haven't really explained why the slinky behaves the way it does. &amp;nbsp;I have only shown that if the slinky collapses from top to bottom (as it appears to), then the top moves at a constant speed (as it appears to).&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="line-height: 17px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="line-height: 17px;"&gt;[UPDATE: Provoked by my amateurish attempts at Physics, Ted Bunn wrote up &lt;a href="http://blog.richmond.edu/physicsbunn/2012/06/26/more-on-the-slinky/"&gt;a version of this model&lt;/a&gt; that deals correctly with the change in the density of the spring from top to bottom. &amp;nbsp;The result is that the speed of the top of the slinky is &lt;i&gt;almost&lt;/i&gt; constant -- it slows down a bit at the end. ]&lt;/span&gt;&lt;/div&gt;
&lt;div style="text-align: left;"&gt;
&lt;span style="background-color: white; font-family: 'Lucida Grande', Verdana, Arial, sans-serif; font-size: 13px; line-height: 17px; text-align: justify;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/ZheOwJUeBhE" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/8052625774468037067/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2012/06/falling-slinky-problem.html#comment-form" title="3 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/8052625774468037067?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/8052625774468037067?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/ZheOwJUeBhE/falling-slinky-problem.html" title="The falling slinky problem" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><thr:total>3</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2012/06/falling-slinky-problem.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DUYGRnk8fCp7ImA9WhJTGE8.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-4899731578432322082</id><published>2012-06-22T12:47:00.001-07:00</published><updated>2012-06-27T12:58:47.774-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-06-27T12:58:47.774-07:00</app:edited><title>Secularization in America: part three</title><content type="html">&lt;span style="background-color: white;"&gt;In&lt;/span&gt;&lt;span style="background-color: white;"&gt;&amp;nbsp;&lt;/span&gt;&lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-one.html" style="background-color: white;"&gt;Part One&lt;/a&gt;&lt;span style="background-color: white;"&gt;&amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;I described trends in market share of major religions in the U.S.: since 1988, the fraction of Protestants dropped from 60% to 51%, and&amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;the fraction of people with no religious affiliation increased from 8% to 18%.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
In &lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-two.html"&gt;Part Two&lt;/a&gt; I used data from the 1988 General Social Survey (GSS) to model transmission of religion from parent to child, and found that the model failed to predict the decrease in Protestants and the increase in Nones that occurred between 1988 and 2010.&lt;br /&gt;
&lt;br /&gt;
I proposed several reasons the model might have failed:&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;The spouse tables are based on the parents of 1988 respondents. &amp;nbsp;People from later generations might be increasingly likely to marry outside their religion.&lt;/li&gt;
&lt;li&gt;The environment table is also based on the previous generation; again, later parents might be making different decisions about the religious environment of their children.&lt;/li&gt;
&lt;li&gt;The transmission table is based on 1988 respondents; it's possible that after 1988, children were less likely to adopt the religion they were raised in. &amp;nbsp;Anecdotally, the culprits most often blamed for this effect are college and the Internet.&lt;/li&gt;
&lt;li&gt;&amp;nbsp;Finally, I have not considered adult conversions from one religious identity to another. &amp;nbsp;The GSS has data on these switches, so I could add them to the model.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
I will investigate each possibility in turn, starting with the prevalence of mixed-religion marriages. &amp;nbsp;In &lt;a href="http://amzn.to/NHJBgb"&gt;Secularization&lt;/a&gt;, Steve Bruce presents results from a study of intermarriage in the UK that found that the rate of vertical transmission:&lt;/div&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;i&gt;&amp;nbsp;"is halved if the parents are of different faiths (even when the differences are just Methodist-Anglican). &amp;nbsp;Even if the parents agree on which faith they wish to pass on, the fact of disagreement makes the child aware that there are good people in other churches and introduces the relativism that weakens conviction. [page 71]"&lt;/i&gt;&lt;/blockquote&gt;
&lt;div&gt;
So if the rate of mixed marriages is increasing, that could contribute to the increasing number of Nones.&amp;nbsp;&lt;/div&gt;
&lt;div&gt;
To measure this effect, I used these GSS variables:&lt;/div&gt;
&lt;div&gt;
&lt;ul&gt;
&lt;li&gt;RELIG: What is your religous preference?&lt;/li&gt;
&lt;li&gt;SPREL:&amp;nbsp;What is your husband's/wife's religious preference?&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
&lt;span style="background-color: white;"&gt;In cases where one partner converts to the other's religion before marriage, that would count (for this model) as a same-religion marriage, since we are interested in the decision the couple makes about the religious environment they raise children in.&lt;/span&gt;&lt;br /&gt;
&lt;h3&gt;
&lt;span style="background-color: white;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;h3&gt;
&lt;span style="background-color: white;"&gt;The Spouse Tables&lt;/span&gt;&lt;/h3&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
The following graph shows the fraction of same-religion marriages over the history of the survey (data for SPREL were not collected every year):&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-coVNbwSUuIA/T-TJod4EohI/AAAAAAAAA4Y/97dnhYsAWQc/s1600/gss.spouse.series.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://4.bp.blogspot.com/-coVNbwSUuIA/T-TJod4EohI/AAAAAAAAA4Y/97dnhYsAWQc/s400/gss.spouse.series.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Before 1988, the fraction of same-religion marriages was around 84%; after 1988 it fell to 78%. &amp;nbsp;The abruptness of the change makes me worry that it may be an artifact; for example, a chance in the wording of the question. &amp;nbsp;Also, t&lt;span style="background-color: white;"&gt;hese results only include respondents who are married, so they are biased toward older people and socio-economic groups that are more likely to be married.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
But as it turns out, even if we take the data at face value, it has a small effect on the model's predictions.&lt;/div&gt;
&lt;div&gt;
I used the respondents from 2004-2010 to build spouse tables for men and women (see &lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-two.html"&gt;Part Two&lt;/a&gt;), then ran the 1988 model again with the anachronistic data. &amp;nbsp;The results are almost identical to what we saw last time:&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-Xpi3TCOlNOo/T-TEnztIzvI/AAAAAAAAA4M/SVu-rkyP5tI/s1600/gss.pred.1988-2010.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://1.bp.blogspot.com/-Xpi3TCOlNOo/T-TEnztIzvI/AAAAAAAAA4M/SVu-rkyP5tI/s400/gss.pred.1988-2010.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The only noticeable effect is that the prediction for Other got worse. &amp;nbsp;&lt;span style="background-color: white;"&gt;I conclude:&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;ol&gt;
&lt;li&gt;It's possible that people are more likely now to marry outside their religion than in 1988, but the difference is small, and&lt;/li&gt;
&lt;li&gt;Even if we cheat by using the 2004-2010 data in 1988, this change does not explain the subsequent changes in the fractions of Protestants and Nones.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;

The Environment Table&lt;/h3&gt;
&lt;/div&gt;
&lt;div&gt;
It seems unlikely that parents now are making different decisions about what religious environment to raise their children in, but just to rule it out, I compared the environment tables for 1988 and 2008.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;span class="Apple-tab-span" style="background-color: white; font-family: 'Courier New', Courier, monospace; font-size: x-small; white-space: pre;"&gt;  &lt;/span&gt;&lt;span style="background-color: white; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;prot &lt;/span&gt;&lt;span class="Apple-tab-span" style="background-color: white; font-family: 'Courier New', Courier, monospace; font-size: x-small; white-space: pre;"&gt; &lt;/span&gt;&lt;span style="background-color: white; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;cath &lt;/span&gt;&lt;span class="Apple-tab-span" style="background-color: white; font-family: 'Courier New', Courier, monospace; font-size: x-small; white-space: pre;"&gt; &lt;/span&gt;&lt;span style="background-color: white; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;jew &lt;/span&gt;&lt;span class="Apple-tab-span" style="background-color: white; font-family: 'Courier New', Courier, monospace; font-size: x-small; white-space: pre;"&gt; &lt;/span&gt;&lt;span style="background-color: white; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;other &lt;/span&gt;&lt;span class="Apple-tab-span" style="background-color: white; font-family: 'Courier New', Courier, monospace; font-size: x-small; white-space: pre;"&gt; &lt;/span&gt;&lt;span style="background-color: white; font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;none &amp;nbsp; &amp;nbsp;change &amp;nbsp;N &amp;nbsp; &amp;nbsp; &amp;nbsp; excess&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;prot-prot&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;97&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;2&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+1&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;634&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;8.7&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;prot-cath&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;43&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;46&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;10&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+4&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;49&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1.9&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;prot- jew&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;prot-othe&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;44&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;21&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;36&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;5&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;prot-none&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;85&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;4&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;11&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+5&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;90&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;4.2&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;cath-prot&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;30&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;56&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;14&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+14&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;37&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;5.1&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;cath-cath&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;98&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;294&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.8&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;cath- jew&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;cath-othe&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;2&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;cath-none&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;2&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;82&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;16&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+7&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;20&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1.5&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;jew-prot&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;jew-cath&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;100&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;jew- jew&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;96&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;4&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;27&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;jew-othe&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;jew-none&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;67&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;33&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+33&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;othe-prot&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;27&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;54&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;18&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+18&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;4&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.7&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;othe-cath&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;43&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;57&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;3&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;othe- jew&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;othe-othe&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;7&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;84&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;9&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+3&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;28&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.7&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;othe-none&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;25&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;25&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;25&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;25&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+25&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;4&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;none-prot&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;83&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;17&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+17&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;6&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;none-cath&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;14&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;68&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;18&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;-82&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;2&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;-1.6&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;none- jew&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;100&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;none-othe&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;-100&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;-1.0&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;none-none&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;23&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;4&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;74&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+18&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;34&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;6.1&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The left column is the mother's-father's religion. &amp;nbsp;The next five columns show the religious environments those parents chose, as reported by their children in 2008. &amp;nbsp;For example, the second row shows that if the mother is Protestant and the father Catholic, 43% of the children were raised Protestant, 46% Catholic, and 10% None.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The next column shows the change in the None column, in percentage points, since the 1988 survey. &amp;nbsp;N is the number of families in 1988 that fell into each category. &amp;nbsp;Finally, Nones is the product of change and N, an estimate of the number of additional Nones in the 1988 survey that could be explained by changes in the environment table. &amp;nbsp;The total of this column is 29, which is not nearly enough to explain the actual excess of 177.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Of course, most of the numbers in the change column are based on small samples, so we should not take them too seriously. &amp;nbsp;By running simulations with resampled survey data, we can take account of these sample sizes.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Using the tables from 1988 to predict the fractions of Nones in 2008, we expect only 8.0% (compared to the actual 16.8%). &amp;nbsp;If we used the environment table from 2008, the prediction goes to 8.5%. &amp;nbsp;If we also use the spouse table, it goes up to 8.7%. &amp;nbsp;So clearly the changes in these tables were not enough to explain the observed changes.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
&lt;h3&gt;
The transmission table&lt;/h3&gt;
The transmission table is a cross-tabulation of the religion the respondent was brought up in and the religion reported when surveyed. &amp;nbsp;It shows the outcome, after some years, of parents' decisions about their children's religious upbringing and the effect of the environment.&lt;br /&gt;
&lt;br /&gt;
The following is the transmission table for 2008, with changes since 1988:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;prot &lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;cath &lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;jew &lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;other &lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;none &lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;change &amp;nbsp;N &amp;nbsp; &amp;nbsp; &amp;nbsp; excess&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;prot &lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;82&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;3&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;2&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;13&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+7&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;951&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;67.7&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;cath &lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;17&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;70&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;2&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;12&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+6&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;414&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;24.6&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;jew &lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;9&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;2&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;73&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;14&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+9&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;31&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;2.7&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;other &lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;12&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;2&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;75&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;10&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+1&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;31&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0.4&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;none &lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;31&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;5&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;2&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;62&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;+5&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;53&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;2.6&lt;/span&gt;&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Each row corresponds to a religious upbringing; each column shows a possible outcome. &amp;nbsp;For example, the first row shows that of children raised Protestant, 82% report that their religious preference is Protestant, and 13% report None. &amp;nbsp;The fraction of Nones has increased 7 percentage points since 1988. Since there are 951 people in this row, this increase accounts of 68 excess Nones in the 2008 survey.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Overall, the changes in the transmission table account for 98 excess Nones, which is a little more than half of the observed increase.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
If we run the simulations again, applying the transmission table from 2008 in 1988, we get the following predictions:&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-JYT2nPBz8_Y/T-thR-9xJ5I/AAAAAAAAA44/REb-WnTm2hY/s1600/gss.pred.1988-2010.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://4.bp.blogspot.com/-JYT2nPBz8_Y/T-thR-9xJ5I/AAAAAAAAA44/REb-WnTm2hY/s400/gss.pred.1988-2010.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
The prediction for Nones is better, but it's clear that this model still misses the mark: it predicts that the fraction of Catholics should be going down, and fails to predict the decrease in the fraction of Protestants.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The problem is that I am treating everyone interviewed in 1988 as a cohort, but they represent people of all ages, who were raised in different environments. &amp;nbsp;Also, I am using data from 2008 to predict what will happen in 2008, so I have got away from the original goal, to see whether the changes that occurred between 1988 and 2008 could have been predicted in 1988.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
However, this model has given me some leads. &amp;nbsp;It looks like a large part of the increase in Nones is due to changes in the transmission table, possibly a small part due to the environment table, and little or none due to the spouse tables.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Next time I will present a different model that reorganizes respondents into cohorts by age of birth, which will make it possible to compare people raised over the same time span. &amp;nbsp;It will also allow me to look for trends that began prior to 1988.&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/aNE6t1-shiY" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/4899731578432322082/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-three.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/4899731578432322082?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/4899731578432322082?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/aNE6t1-shiY/secularization-in-america-part-three.html" title="Secularization in America: part three" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/-coVNbwSUuIA/T-TJod4EohI/AAAAAAAAA4Y/97dnhYsAWQc/s72-c/gss.spouse.series.png" height="72" width="72" /><thr:total>0</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2012/06/secularization-in-america-part-three.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DEYGQXYyeyp7ImA9WhJTE00.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-171064178008558050</id><published>2012-06-21T12:14:00.001-07:00</published><updated>2012-06-21T12:15:20.893-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-06-21T12:15:20.893-07:00</app:edited><title>Secularization in America: part two</title><content type="html">In &lt;a href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-one.html"&gt;Part One&lt;/a&gt; I described some trends in market share of the major religions in the U.S.; in particular, since 1988, the fraction of Protestants dropped from 60% to 51%, and&amp;nbsp;&lt;span style="background-color: white;"&gt;the fraction of people with no religious affiliation increased from 8% to 18%.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
I would like to know if something happened after 1988 to cause these changes, or if they could have been predicted based on patterns occurring before 1988. &amp;nbsp;As a first step, I will use data from 1988 to model vertical transmission (from parent to child) and see if it predicts the observed changes&lt;br /&gt;
&lt;br /&gt;
My model of vertical transmission works like this:&lt;br /&gt;
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;Each respondent chooses a spouse,&lt;/li&gt;
&lt;li&gt;Each pair decides what religion to bring their children up in,&lt;/li&gt;
&lt;li&gt;Each child chooses a religion.&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
I model each step of this process using data from&lt;span style="background-color: white;"&gt;&amp;nbsp;the General Social Survey (GSS); specifically, I used these variables.&lt;/span&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span style="background-color: white;"&gt;RELIG: What is your religous preference?&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style="background-color: white;"&gt;RELIG16: In what religion were you raised?&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;MARELIG:&amp;nbsp;&lt;span style="background-color: white;"&gt;What was your mother's religious preference when you were&amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;growing up?&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;PARELIG: &lt;span style="background-color: white;"&gt;What was your fathers's religious preference when you were&amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;growing up?&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
The first two questions were asked every year, but questions about parents' religion were only asked in 1988 and 2008. &amp;nbsp;I will use the data from 1988 to build and validate models, then use the data from 2008 to make predictions.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
I used MARELIG and PARELIG to build two "Spouse tables", one for men and one for women. &amp;nbsp;Here is the table for men:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;h3&gt;

Spouse Table (men)&lt;/h3&gt;
&lt;table border="1" cellpadding="4" style="border-collapse: collapse; border: 1px solid #000000;"&gt;
 &lt;tbody&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;prot&lt;/th&gt;
  &lt;th&gt;cath&lt;/th&gt;
  &lt;th&gt;jew&lt;/th&gt;
  &lt;th&gt;other&lt;/th&gt;
  &lt;th&gt;none&lt;/th&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;prot&lt;/td&gt;
  &lt;td&gt;93&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;cath&lt;/td&gt;
  &lt;td&gt;14&lt;/td&gt;
  &lt;td&gt;85&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;jew&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;96&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;other&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;90&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;none&lt;/td&gt;
  &lt;td&gt;59&lt;/td&gt;
  &lt;td&gt;13&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;3&lt;/td&gt;
  &lt;td&gt;24&lt;/td&gt;
 &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h3&gt;

&lt;span style="font-size: small; font-weight: normal;"&gt;Each row indicates the religion of a male respondent; each column is the religion of a possible spouse; the numbers are percents. &amp;nbsp;For example, the first row indicates that 93% of male Protestants married other Protestants, and another 6% married Catholics.&lt;/span&gt;&lt;/h3&gt;
&lt;div&gt;
&lt;span style="font-size: small; font-weight: normal;"&gt;Here is the spouse table for women:&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-size: small; font-weight: normal;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;h3&gt;

Spouse Table (women)&lt;/h3&gt;
&lt;table border="1" cellpadding="4" style="border-collapse: collapse; border: 1px solid #000000;"&gt;
 &lt;tbody&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;prot&lt;/th&gt;
  &lt;th&gt;cath&lt;/th&gt;
  &lt;th&gt;jew&lt;/th&gt;
  &lt;th&gt;other&lt;/th&gt;
  &lt;th&gt;none&lt;/th&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;prot&lt;/td&gt;
  &lt;td&gt;82&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;12&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;cath&lt;/td&gt;
  &lt;td&gt;10&lt;/td&gt;
  &lt;td&gt;85&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;5&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;jew&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;96&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;other&lt;/td&gt;
  &lt;td&gt;8&lt;/td&gt;
  &lt;td&gt;3&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;74&lt;/td&gt;
  &lt;td&gt;15&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;none&lt;/td&gt;
  &lt;td&gt;13&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;80&lt;/td&gt;
 &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h3&gt;

&lt;span style="font-size: small;"&gt;&lt;span style="font-weight: normal;"&gt;In general, women are more likely to marry out of their religion than men, but still the great majority marry a co-religionist. &amp;nbsp;One asymmetry is apparent: men with no religion seldom marry another None (24%), but women with no religion usually do (80%). &amp;nbsp;This effect is partly due to the gender gap: 11% of male respondents are Nones, but only 5% of the women are (there is a similar, possibly smaller, gender gap in the &lt;a href="http://www.secularhumanism.org/index.php?section=library&amp;amp;page=downey_27_5"&gt;CIRP data&lt;/a&gt;).&lt;/span&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;div&gt;
Once the respondents have paired up, they decide what religion to raise the children in. &amp;nbsp;The following table shows results from the 1988 data. &amp;nbsp;The rows enumerate all pairs of mother's and father's religion; the columns indicate the religious environment they chose. &amp;nbsp;For example, the second row indicates that if a Protestant woman marries a Catholic man, they raise the children Protestant 58% of the time, Catholic 36% of the time, and None 6%.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;h3&gt;

Environment table&lt;/h3&gt;
&lt;table border="1" cellpadding="4" style="border-collapse: collapse; border: 1px solid #000000;"&gt;
 &lt;tbody&gt;
&lt;tr&gt;
  &lt;th&gt;parents&lt;/th&gt;
  &lt;th&gt;prot&lt;/th&gt;
  &lt;th&gt;cath&lt;/th&gt;
  &lt;th&gt;jew&lt;/th&gt;
  &lt;th&gt;other&lt;/th&gt;
  &lt;th&gt;none&lt;/th&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;prot-prot&lt;/td&gt;
  &lt;td&gt;99&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;prot-cath&lt;/td&gt;
  &lt;td&gt;58&lt;/td&gt;
  &lt;td&gt;36&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;prot-jew&lt;/td&gt;
  &lt;td&gt;100&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;prot-other&lt;/td&gt;
  &lt;td&gt;100&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;prot-none&lt;/td&gt;
  &lt;td&gt;89&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;cath-prot&lt;/td&gt;
  &lt;td&gt;39&lt;/td&gt;
  &lt;td&gt;61&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;cath-cath&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
  &lt;td&gt;99&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;cath-jew&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;cath-other&lt;/td&gt;
  &lt;td&gt;100&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;cath-none&lt;/td&gt;
  &lt;td&gt;17&lt;/td&gt;
  &lt;td&gt;69&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
  &lt;td&gt;8&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;jew-prot&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;100&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;jew-cath&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;jew-jew&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;96&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;jew-other&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;jew-none&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;other-prot&lt;/td&gt;
  &lt;td&gt;60&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;40&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;other-cath&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;100&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;other-jew&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;other-other&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;2&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;89&lt;/td&gt;
  &lt;td&gt;2&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;other-none&lt;/td&gt;
  &lt;td&gt;33&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;67&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;none-prot&lt;/td&gt;
  &lt;td&gt;100&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;none-cath&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;100&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;none-jew&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;none-other&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;none-none&lt;/td&gt;
  &lt;td&gt;40&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;56&lt;/td&gt;
 &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;span style="font-size: small;"&gt;&lt;span style="font-weight: normal;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="font-size: small;"&gt;&lt;span style="font-weight: normal;"&gt;One surprise in this table is the last row: when two people with no religion marry, 40% of the time they apparently choose to raise their children Protestant. &amp;nbsp;This seems unlikely, but there are several possible explanations: (1) the parents might have chosen to raise their children in the prevalent religion of their community, (2) a respondent might not have been raised by his parents,&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: white;"&gt;(3) a respondent might not be reporting his parents' religion accurately&lt;/span&gt;&lt;span style="background-color: white; font-size: small;"&gt;. &amp;nbsp;F&lt;/span&gt;&lt;span style="background-color: white;"&gt;or purposes of modeling I take these responses at face value.&lt;/span&gt;&lt;br /&gt;
&lt;span style="background-color: white;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
Children raised with a religion usually adopt that religion, but not always. &amp;nbsp;The following "transition table" shows possible outcomes for each religious environment. &amp;nbsp;For example, 89% of respondents who say they were raised Protestant also report that their religious preference is Protestant, but 3% are Catholic and 6% have no religious preference. &amp;nbsp;More people convert from Catholic to Protestant than the other way around.&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;

Transition table&lt;/h3&gt;
&lt;table border="1" cellpadding="4" style="border-collapse: collapse; border: 1px solid #000000;"&gt;
 &lt;tbody&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;prot&lt;/th&gt;
  &lt;th&gt;cath&lt;/th&gt;
  &lt;th&gt;jew&lt;/th&gt;
  &lt;th&gt;other&lt;/th&gt;
  &lt;th&gt;none&lt;/th&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;prot&lt;/td&gt;
  &lt;td&gt;89&lt;/td&gt;
  &lt;td&gt;3&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;cath&lt;/td&gt;
  &lt;td&gt;11&lt;/td&gt;
  &lt;td&gt;83&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;jew&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;95&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;5&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;other&lt;/td&gt;
  &lt;td&gt;5&lt;/td&gt;
  &lt;td&gt;3&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;83&lt;/td&gt;
  &lt;td&gt;9&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;none&lt;/td&gt;
  &lt;td&gt;32&lt;/td&gt;
  &lt;td&gt;11&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;57&lt;/td&gt;
 &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
As expected, the majority of people raised with no religion report no religious preference, but 32% of them identify as Protestant and 11% identify as Catholic. &amp;nbsp;I found that surprising. &amp;nbsp;I will look more closely later, but for now, again, I will take it at face value.&lt;br /&gt;
&lt;br /&gt;
Finally, we can combine these results into a single "Generation table" that shows the transitions from one generation to the next. &amp;nbsp;I ran simulations with following steps.&lt;br /&gt;
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;For each respondent, choose a spouse's religion from the Spouse Table.&lt;/li&gt;
&lt;li&gt;For each parent pair, choose a religious environment from the Environment Table.&lt;/li&gt;
&lt;li&gt;For each hypothetical child, choose a religious identity from the Transition Table.&lt;/li&gt;
&lt;li&gt;For each parent-child pair, make an entry in the Generation Table, below.&lt;span style="background-color: white;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;br /&gt;
Since this computation is based on random simulations, it varies from run to run, but here is a typical outcome:&lt;br /&gt;
&lt;br /&gt;
&lt;h3&gt;

&lt;span style="background-color: white;"&gt;Generation table&lt;/span&gt;&lt;/h3&gt;
&lt;table border="1" cellpadding="4" style="border-collapse: collapse; border: 1px solid #000000;"&gt;
 &lt;tbody&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;prot&lt;/th&gt;
  &lt;th&gt;cath&lt;/th&gt;
  &lt;th&gt;jew&lt;/th&gt;
  &lt;th&gt;other&lt;/th&gt;
  &lt;th&gt;none&lt;/th&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;prot&lt;/td&gt;
  &lt;td&gt;86&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;cath&lt;/td&gt;
  &lt;td&gt;19&lt;/td&gt;
  &lt;td&gt;72&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;jew&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;95&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;5&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;other&lt;/td&gt;
  &lt;td&gt;29&lt;/td&gt;
  &lt;td&gt;10&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;55&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
 &lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;none&lt;/td&gt;
  &lt;td&gt;67&lt;/td&gt;
  &lt;td&gt;9&lt;/td&gt;
  &lt;td&gt;0&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
  &lt;td&gt;23&lt;/td&gt;
 &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
Assuming that a generation time is about 22 years, we can use this model to predict the distribution of religions in 2010 (using only data from 1988). &amp;nbsp;This figure shows the actual time series and the model predictions for each group:&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-yxWR2MbzE6g/T-Nr3aDxhwI/AAAAAAAAA34/A5XXqBMG7Co/s1600/gss.pred.1988-2010.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://4.bp.blogspot.com/-yxWR2MbzE6g/T-Nr3aDxhwI/AAAAAAAAA34/A5XXqBMG7Co/s400/gss.pred.1988-2010.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
On the right side of the plot, the vertical lines show the 90% confidence interval; the boxes show the mean of 20 simulation runs. &amp;nbsp;[One technical note: each simulation is based on tables from resampled survey data, so the confidence intervals reflect both the sampling error of the survey and random variation of the simulations.]&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The actual values for Catholics, Jews and Other fall within the prediction intervals, but the model fails to predict the decrease in Protestants or the increase in None.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
So, what's missing from this model that could account for the observed changes?&lt;/div&gt;
&lt;div&gt;
&lt;ol&gt;
&lt;li&gt;The spouse tables are based on the parents of 1988 respondents. &amp;nbsp;People from later generations are increasingly likely to marry outside their religion.&lt;/li&gt;
&lt;li&gt;The environment table is also based on the previous generation; again, later parents might be making different decisions about the religious environment of their children.&lt;/li&gt;
&lt;li&gt;The transition table is based on 1988 respondents; it's possible that after 1988, children were less likely to adopt the religion they were raised in. &amp;nbsp;Anecdotally, the culprits most often blamed for this effect are college and the Internet.&lt;/li&gt;
&lt;li&gt;&amp;nbsp;Finally, I have not considered adult conversions from one religious identity to another. &amp;nbsp;The GSS has data on these switches, so I could add them to the model.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
Over the next few installments, I will investigate each of these factors to see which, if any, account for the observed changes.&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/lA1bxJ2gQ00" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/171064178008558050/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-two.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/171064178008558050?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/171064178008558050?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/lA1bxJ2gQ00/secularization-in-america-part-two.html" title="Secularization in America: part two" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/-yxWR2MbzE6g/T-Nr3aDxhwI/AAAAAAAAA34/A5XXqBMG7Co/s72-c/gss.pred.1988-2010.png" height="72" width="72" /><thr:total>0</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2012/06/secularization-in-america-part-two.html</feedburner:origLink></entry><entry gd:etag="W/&quot;D0IAQHc6cCp7ImA9WhJTEk0.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-8754370668378154976</id><published>2012-06-19T07:43:00.000-07:00</published><updated>2012-06-20T08:19:01.918-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-06-20T08:19:01.918-07:00</app:edited><title>Secularization in America, part one.</title><content type="html">In the last year or so I have written several articles about trends in religion among college students:&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="http://are%20religious%20colleges%20getting%20more%20religious/?"&gt;Are religious colleges getting more religious?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://allendowney.blogspot.com/2011/03/freshman-hordes-more-godless-than-ever.html"&gt;Freshman hordes more godless than ever!&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://allendowney.blogspot.com/2012/01/freshman-hordes-even-more-godless.html"&gt;Freshman hordes even more godless!&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
All of these are based on data from&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; background-color: white;"&gt;&amp;nbsp;the&amp;nbsp;&lt;/span&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; background-color: white;"&gt;&lt;a href="http://www.gseis.ucla.edu/heri/cirpoverview.php"&gt;Cooperative Institutional Research Program (CIRP)&lt;/a&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; background-color: white;"&gt;&amp;nbsp;which runs&lt;/span&gt;&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; background-color: white;"&gt;&amp;nbsp;the &lt;a href="http://www.heri.ucla.edu/cirpoverview.php"&gt;Freshman Survey&lt;/a&gt;, an annual survey of more than 200,000 incoming students at 270 colleges and universities in the U.S.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; background-color: white;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; background-color: white;"&gt;More recently, I read &lt;a href="http://amzn.to/NHJBgb"&gt;Secularization: In Defence of an Unfashionable Theory&lt;/a&gt;, by Steve Bruce. &amp;nbsp;Bruce presents the "unfashionable theory" that as societies modernize, they secularize. &amp;nbsp;In his formulation, modernization includes trends toward individualism, industrial capitalism, science and technology; and secularization means "decline in the social significance of religion."&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; background-color: white;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
The poster child for secularization is Western Europe, where the social influence of religion has been in decline for centuries, and where in every country the fraction of people with no religious affiliation has been increasing for decades.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
But skeptics have suggested that countries where people are still religious, like the United States and many countries in the Middle East, are exceptions that disprove the theory. &amp;nbsp;Bruce replies that religious countries in the Middle East are not exceptions because they are not modern, and the United States is not an exception because it is, in fact, secularizing.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The data from the Freshman Survey are consistent with secularization. &amp;nbsp;The number of incoming college students with no religious affiliation has been climbing consistently since 1978, and the number of students reporting participation in religious service has fallen at about the same rate.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Of course, college students are not a random sample of the population; for that, we can use data from the &lt;a href="http://www3.norc.org/gss+website/"&gt;General Social Survey&lt;/a&gt;&amp;nbsp;(GSS), which is (according to the GSS) "widely regarded as the single best source of data on societal trends." &amp;nbsp;It has run since 1972; each year (or every other year since 1994) it surveys a sample of about 2000 adults randomly sampled from the U.S. population. &amp;nbsp;Respondents answer hundreds of questions about their background, life history, and beliefs. &amp;nbsp;Many questions are repeated from year to year for trend analysis.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
I will use this dataset to answer several questions:&lt;/div&gt;
&lt;div&gt;
&lt;ol&gt;
&lt;li&gt;Is there evidence of secularization in the U.S. (Hint: yes).&lt;/li&gt;
&lt;li&gt;Can we explain the causes?&lt;/li&gt;
&lt;li&gt;Can we predict how these trends will continue over the next few decades.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
To get started, I tracked responses to the question, "What is your current religious preference?" &amp;nbsp;The original set of options was Protestant, Catholic, Jewish, some other religion, or no religion. &amp;nbsp;After 1994, the set of options was expanded, but for my purposes the original options are enough to describe large-scale trends. &amp;nbsp;The following graph shows the fraction of the population in each group over time.&lt;/div&gt;
&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-sL8EeHjYemc/T-CIMzssUZI/AAAAAAAAA3c/SxwcfUglzNc/s1600/gss.1972-2010.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://2.bp.blogspot.com/-sL8EeHjYemc/T-CIMzssUZI/AAAAAAAAA3c/SxwcfUglzNc/s400/gss.1972-2010.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; background-color: white;"&gt;A few trends are apparent: the percentage of Protestants is declining; the percentages of Other and None are increasing. &amp;nbsp;&lt;/span&gt;&lt;span style="background-color: white;"&gt;These trends are clearer in the following figures, broken into two intervals:&lt;/span&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-oxUxEn7DeXc/T-CKAPqcqsI/AAAAAAAAA3k/l424cUaUOS4/s1600/gss.change.1972-1988.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://4.bp.blogspot.com/-oxUxEn7DeXc/T-CKAPqcqsI/AAAAAAAAA3k/l424cUaUOS4/s400/gss.change.1972-1988.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; background-color: white;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
From 1972 to 1988, the fraction of Protestants and Catholics was unchanged, but the fraction of Nones may have increased.&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-zJar6Mq0e28/T-CKWAhGErI/AAAAAAAAA3s/dusYXbTRwCA/s1600/gss.change.1988-2010.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://1.bp.blogspot.com/-zJar6Mq0e28/T-CKWAhGErI/AAAAAAAAA3s/dusYXbTRwCA/s400/gss.change.1988-2010.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
From 1988 to 2010 (the most recent survey year), the fraction of Protestants and Jews declined, and the fraction of Nones increased by almost 250%. &amp;nbsp;The number of Others increased during both intervals, with more variability.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
This dataset shows signs of secularization in the U.S., at least since 1972. &amp;nbsp;But religious affiliation is just one aspect of religious identity; there is a lot more data in the GSS to look at.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
My particular interest is in explaining the trends we have seen so far, and predicting what's coming next. &amp;nbsp;It is tempting to think that something happened in 1988 to cause the inflections in these curves, but I think it is more likely that the origin of these changes goes back farther. &amp;nbsp;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
To test that idea, let's pretend that it's 1988. &amp;nbsp;We have see some changes in the market share of different religions since 1972, but nothing bigger than a few percentage points, and no indication of acceleration. &amp;nbsp;Could we have predicted the much larger changes coming between 1988 and 2010?&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
In the next few articles, I develop several models intended to answer that question. &amp;nbsp;Then I turn to prediction: using the data up to 2010 (and 2012 when it is available) what can we expect in the next 20 years?&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/ihDKYfxPJX8" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/8754370668378154976/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2012/06/secularization-in-america-part-one.html#comment-form" title="8 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/8754370668378154976?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/8754370668378154976?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/ihDKYfxPJX8/secularization-in-america-part-one.html" title="Secularization in America, part one." /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/-sL8EeHjYemc/T-CIMzssUZI/AAAAAAAAA3c/SxwcfUglzNc/s72-c/gss.1972-2010.png" height="72" width="72" /><thr:total>8</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2012/06/secularization-in-america-part-one.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CEcCR3g9eCp7ImA9WhVVFE0.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-1291851513354252112</id><published>2012-05-07T07:47:00.001-07:00</published><updated>2012-05-07T07:47:46.660-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-05-07T07:47:46.660-07:00</app:edited><title>Are religious colleges getting more religious?</title><content type="html">In response to &lt;a href="http://allendowney.blogspot.com/2012/01/freshman-hordes-even-more-godless.html"&gt;my article about the increasing numbers of students&lt;/a&gt; entering college with no religious affiliation, a reader wrote:&lt;br /&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;i&gt;&lt;span style="font-family: inherit;"&gt;&lt;span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222;"&gt;A potentially interesting trend to look at is the religious participation of students attending Catholic or other religious institutions. I wonder if the trend is toward religious students (those who report both religious affiliation and participation in students) being more likely to go to religiously-affiliated colleges. This would mean a reduction of religious students at state/non-sectarian schools and an increase in the religiosity of students at affiliated schools (this might even be skewed because the survey doesn't include some of the most religious schools such as Liberty University). This would be a reflection of the increasing polarization of our society.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&amp;nbsp;&lt;/blockquote&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;i&gt;&lt;span style="font-family: inherit;"&gt;&lt;span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: inherit;"&gt;&lt;span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222;"&gt;On another subject, I tend to distrust steadily increasing social trends. From a complexity theory perspective, I would expect more of a cycle in religious (dis-)belief, so I wouldn't be surprised to see a crash in the number of non-believers sometime in the near future, although predicting exactly when that will happen is virtually impossible with the relatively small amount of data available (40 years is not enough).&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/i&gt;&lt;/blockquote&gt;
&lt;span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222;"&gt;&lt;span style="font-family: inherit;"&gt;There are several interesting questions here. &amp;nbsp;The first is whether religious colleges are getting more religious. &amp;nbsp;This one is relatively easy to investigate: the HERI survey provides data broken down by several types of colleges, including Nonsectarian, Catholic, and Other Religious Affiliation.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222;"&gt;&lt;span style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222;"&gt;&lt;span style="font-family: inherit;"&gt;The good news is that &lt;a href="http://www.heri.ucla.edu/tfsPublications.php"&gt;their reports are available in PDF now&lt;/a&gt;. &amp;nbsp;The bad news is that most of the older ones are scanned and not OCRed, so they are not easy to search. &amp;nbsp;I went through them by hand, but I only extracted the data at 5-year intervals. &amp;nbsp;Here is what the trends look like for students responding "None" for religious preference at Private nonsectarian 4-year colleges, Catholic colleges, and Other religious colleges:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-zw7daWJ8whI/T6QvUVSfEHI/AAAAAAAAA0M/cUa4eoQW8G8/s1600/heri.religious.0.3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://2.bp.blogspot.com/-zw7daWJ8whI/T6QvUVSfEHI/AAAAAAAAA0M/cUa4eoQW8G8/s400/heri.religious.0.3.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222;"&gt;&lt;span style="font-family: inherit;"&gt;The percentage of Nones is increasing in all categories, more slowly at religious colleges than at other private colleges. &amp;nbsp;Nevertheless, the fraction of students at religious colleges with no religious preference has nearly tripled in the last 35 years.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222;"&gt;&lt;span style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #222222;"&gt;As an aside, I also plotted data for historically black colleges and universities (HBCU). &amp;nbsp;Here's what that looks like:&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #222222;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-RxC8JdxnXZw/T6QwTW2PtrI/AAAAAAAAA0U/xX_bUrM5hDo/s1600/heri.religious.3.5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://1.bp.blogspot.com/-RxC8JdxnXZw/T6QwTW2PtrI/AAAAAAAAA0U/xX_bUrM5hDo/s400/heri.religious.3.5.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;span style="color: #222222;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222;"&gt;&lt;span style="font-family: inherit;"&gt;Clearly the trend is slower; in fact, it is not obvious that it is statistically significant. &amp;nbsp;And since about 1990, the percentage of Nones is higher at religious colleges than at HBCUs, by more than a factor of two (13%, compared with 7%).&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="background-color: rgba(255, 255, 255, 0.917969); color: #222222; font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #222222;"&gt;Getting back to the reader's question, it doesn't look like the religious schools are getting more religious. &amp;nbsp; &amp;nbsp;In fact, they are getting less religious at almost the same rate as other schools. &amp;nbsp;But maybe the fraction of students going to religious colleges is increasing?&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #222222;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #222222;"&gt;Here is &lt;a href="http://nces.ed.gov/programs/digest/d10/tables/dt10_205.asp"&gt;a table from the National Center For Educational Statistics, which publishes the Digest of Education Statistics&lt;/a&gt;. &amp;nbsp;It shows "Fall enrollment and number of degree-granting institutions, by control and affiliation of institution: Selected years, 1980 through 2009."&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #222222;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #222222;"&gt;Here's what that data looks like:&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #222222;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-cy0ZmO9JSNA/T6fbNW24_6I/AAAAAAAAA1A/em_UL8_XeVE/s1600/heri.religious2.raw.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://4.bp.blogspot.com/-cy0ZmO9JSNA/T6fbNW24_6I/AAAAAAAAA1A/em_UL8_XeVE/s400/heri.religious2.raw.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;span style="color: #222222;"&gt;Enrollments have been increasing for all college types, with religious colleges growing faster than private nonsectarian colleges until 1995. &amp;nbsp;Here's what these data look like expressed as a percent of the total:&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #222222;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-3yE-NOnm6OQ/T6fbwnPPlOI/AAAAAAAAA1I/AGy0TNphDNo/s1600/heri.religious2.percent.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://1.bp.blogspot.com/-3yE-NOnm6OQ/T6fbwnPPlOI/AAAAAAAAA1I/AGy0TNphDNo/s400/heri.religious2.percent.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;span style="color: #222222;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #222222;"&gt;Before 1995, religious colleges were gaining market share, at the expense of&amp;nbsp;nonsectarian colleges; other than that, there is not much going on.&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #222222;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #222222;"&gt;So, to address the reader's questions:&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;&lt;span style="color: #222222;"&gt;Are religious students more likely to attend religious schools? &amp;nbsp;Maybe. &amp;nbsp;It's hard to tell with the data I have.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style="color: #222222;"&gt;Are more students going to religious schools? &amp;nbsp;Not lately. &amp;nbsp;Since 1995, the fraction of students at religious colleges has been flat.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style="color: #222222;"&gt;Are students at religious schools increasingly religious? &amp;nbsp;No. &amp;nbsp;The percentage of Nones has increased in all college types, including religious colleges.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style="color: #222222;"&gt;Do these trends introduce a bias in the results I presented? &amp;nbsp;Not that I can see.&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
&lt;span style="color: #222222;"&gt;As for distrusting&amp;nbsp;steadily increasing social trends, I agree that some caution is needed. &amp;nbsp;The percentage of Nones can't keep accelerating forever. &amp;nbsp;But it can keep growing forever.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #222222;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #222222;"&gt;Steven Pinker presents several one-way trends in his new book, &lt;i&gt;&lt;a href="http://amzn.to/JKTlxR"&gt;The Better Angels of Our Nature&lt;/a&gt;&lt;/i&gt;. &amp;nbsp;And Peter Singer, who &lt;a href="http://www.nytimes.com/2011/10/09/books/review/the-better-angels-of-our-nature-by-steven-pinker-book-review.html?pagewanted=all"&gt;reviewed Pinker's book in the &lt;i&gt;New York Times&lt;/i&gt;&lt;/a&gt;, discussed related ideas in &lt;i&gt;&lt;a href="http://amzn.to/JiWBFd"&gt;The Expanding Circle&lt;/a&gt;&lt;/i&gt;.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #222222;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #222222;"&gt;But who knows? &amp;nbsp;I guess we'll see what next year's data point looks like.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #222222;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #222222;"&gt;Many thanks to the reader who posted the comments that prompted this update.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #222222;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style="color: #222222;"&gt;Here is my &lt;a href="http://allendowney.blogspot.com/2011/03/freshman-hordes-more-godless-than-ever.html"&gt;original article from March 2011&lt;/a&gt;. &amp;nbsp;Here is the &lt;a href="http://allendowney.blogspot.com/2012/01/freshman-hordes-even-more-godless.html"&gt;update from January 2012&lt;/a&gt;.&lt;/span&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;span style="color: #222222;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="color: #222222;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/uWn2dbVgFio" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/1291851513354252112/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2012/05/are-religious-colleges-getting-more.html#comment-form" title="0 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/1291851513354252112?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/1291851513354252112?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/uWn2dbVgFio/are-religious-colleges-getting-more.html" title="Are religious colleges getting more religious?" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/-zw7daWJ8whI/T6QvUVSfEHI/AAAAAAAAA0M/cUa4eoQW8G8/s72-c/heri.religious.0.3.png" height="72" width="72" /><thr:total>0</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2012/05/are-religious-colleges-getting-more.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CEENSXgyeyp7ImA9WhVWFEg.&quot;"><id>tag:blogger.com,1999:blog-6894866515532737257.post-1804350540193916200</id><published>2012-04-25T11:08:00.001-07:00</published><updated>2012-04-26T08:04:58.693-07:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2012-04-26T08:04:58.693-07:00</app:edited><title>Fog warning system: part three</title><content type="html">&lt;br /&gt;
&lt;b&gt;Background&lt;/b&gt;&lt;span style="font-family: inherit;"&gt;:&amp;nbsp;&lt;span style="background-color: white; line-height: 18px;"&gt;I am trying to evaluate the effect on traffic safety of a fog warning system deployed in California in November 1996. &amp;nbsp;The system was installed by CalTrans on a section of I-5 and SR-120 near Stockton where the accident rate is generally high, particularly during the morning commute when ground fog is common. &amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: white; font-family: inherit; line-height: 18px;"&gt;The warning system consists of (1) weather monitoring stations that detect fog and (2) changeable message signs that warn drivers to reduce speed.&lt;/span&gt;&lt;br /&gt;
&lt;span style="background-color: white; font-family: inherit; line-height: 18px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style="background-color: white; font-family: inherit; line-height: 18px;"&gt;I will post my findings as I go in order to solicit comments from professionals and demonstrate methods for students. &amp;nbsp;If I can get permission, I will also post my data and code so you can follow along at home.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Previously&lt;/b&gt;: In the&amp;nbsp;first&lt;a href="http://allendowney.blogspot.com/2012/04/fog-warning-system-life-saver-or-road.html"&gt;&amp;nbsp;installment&lt;/a&gt;&amp;nbsp;I reviewed the first batch of data I am working with, and ran some tests to confirm that Poisson regression is appropriate for modeling the number of accidents in a given day. &amp;nbsp;In &lt;a href="http://allendowney.blogspot.com/2012/04/fog-warning-system-part-two.html"&gt;part two&lt;/a&gt; I ran Poisson regressions to identify factors that influence the number of accidents per day.&lt;br /&gt;
&lt;br /&gt;
&lt;h4&gt;




Critical events&lt;/h4&gt;
I have been waiting to get more details about several events that affected traffic safety during the observation period. &amp;nbsp;I was able to get in touch with a&amp;nbsp;Transportation Engineer in the&amp;nbsp;Traffic Safety Branch of Caltrans District 10, which includes the study area. &amp;nbsp;According to Caltrans records, the speed limit on the relevant section of I-5 was increased from 55 to 70 mph on March 25, 1996. &amp;nbsp;The speed limit on SR-120 was increased from 55 to 65 mph about a month later, on April 22, 1996. &amp;nbsp;Many thanks to my correspondent for this information!&lt;br /&gt;
&lt;br /&gt;
The automated warning system was activated in November 1996. &amp;nbsp;My collaborator has collected data on weather measurements made by the system and the warning it displayed. &amp;nbsp;I hope to get this data processed soon.&lt;br /&gt;
&lt;h4&gt;




Accidents per million vehicles&lt;/h4&gt;
In the previous article, I ran models with raw accident counts as the dependent variable, and found that traffic volume is a significant explanatory variable. &amp;nbsp;Not surprisingly, more cars yield more accidents.&lt;br /&gt;
&lt;br /&gt;
Rather than use volume as an explanatory variable, an alternative is to express the dependent variable in terms of accidents per million vehicles. &amp;nbsp;As a reminder, here's what the traffic volume (in thousands of cars per day) looks like during the observation period:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-qwJBmQ0Z0r4/T5gf2l6bM9I/AAAAAAAAAyk/jHkwdwYfyac/s1600/caws.traffic.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://2.bp.blogspot.com/-qwJBmQ0Z0r4/T5gf2l6bM9I/AAAAAAAAAyk/jHkwdwYfyac/s400/caws.traffic.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
And here are the raw accident counts:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-V3tWBkbf3tM/T5gv77jrumI/AAAAAAAAAyw/KBGIXS0h_IA/s1600/caws.accident.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://2.bp.blogspot.com/-V3tWBkbf3tM/T5gv77jrumI/AAAAAAAAAyw/KBGIXS0h_IA/s400/caws.accident.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
I divided counts by volume and converted to accidents per million cars. &amp;nbsp;At the same time I smoothed the curves by aggregating quarterly. &amp;nbsp;Here's what that looks like:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-B8pkEphxR2g/T5gwjl6Uy4I/AAAAAAAAAy4/WHOq1496Cr4/s1600/caws.accident.0.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://3.bp.blogspot.com/-B8pkEphxR2g/T5gwjl6Uy4I/AAAAAAAAAy4/WHOq1496Cr4/s400/caws.accident.0.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
The vertical red lines show major events expected to affect traffic safety: increased speed limits in March and April 1996, and the activation of the warning system in November 1996.&lt;br /&gt;
&lt;br /&gt;
This graph suggests several observations:&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;In the control directions, the accident rate was flat from 1992 through 1994, increased quickly in 1995 (&lt;i&gt;before the speed limits were increased)&lt;/i&gt;&amp;nbsp;and has been flat every since.&lt;/li&gt;
&lt;li&gt;In the treatment directions, the accident rate was trending down until late 1996, including three quarters after the speed limit was increased. &amp;nbsp;The accident rate increased sharply in 1997 and possibly again in 2000.&lt;/li&gt;
&lt;li&gt;The accident rate in both directions was unusually low during the third quarter of 1996, when the warning system was activated. &amp;nbsp;Other than that, there is no obvious relationship between accident rates and the events of 1996.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
Since we don't expect the warning system to have much effect on the control directions (that's why they're called "control"), the speed limit changes are by far the most likely explanation for the accident rate changes. &amp;nbsp;But it is puzzling that a large part of the change occurred before the new speed limits went into effect. &amp;nbsp;One possibility is that as new speed limits were rolled out throughout California, drivers became accustomed to higher speeds and drove faster even on roads where the new limits were not in effect. &amp;nbsp;But if that's true, it doesn't explain the continuing decline in the treatment directions.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
My collaborator has some data on actual driving speeds before and after 1996. &amp;nbsp;Once I process that data, I will be able to get back to this puzzle.&lt;/div&gt;
&lt;br /&gt;
&lt;h4&gt;


Injuries and fatal accidents&lt;/h4&gt;
In response to a previous post, a reader suggested that if the warning system causes drivers to slow down, it might affect the severity of accidents more than the raw number. &amp;nbsp;To investigate that possibility, I also plotted the rates for injury accidents (including fatalities) and fatal accidents.&lt;br /&gt;
&lt;br /&gt;
Here is the graph for injury accidents:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-YPZj_dcO_aE/T5g0XX1XR0I/AAAAAAAAAzE/wBBNWqT1GUg/s1600/caws.accident.1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://1.bp.blogspot.com/-YPZj_dcO_aE/T5g0XX1XR0I/AAAAAAAAAzE/wBBNWqT1GUg/s400/caws.accident.1.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
The patterns we saw in the previous graph appear here, too. &amp;nbsp;In addition, this graph suggests, more strongly, the possibility of a second changepoint in late 1999 or 2000.&lt;br /&gt;
&lt;br /&gt;
And here is the graph for fatal accidents:&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-h50DQ9b7VP4/T5g07qEOKOI/AAAAAAAAAzM/0TZGQG85PHQ/s1600/caws.accident.2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="300" src="http://2.bp.blogspot.com/-h50DQ9b7VP4/T5g07qEOKOI/AAAAAAAAAzM/0TZGQG85PHQ/s400/caws.accident.2.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
The number of fatal accidents is, fortunately, small. &amp;nbsp;During more than 10 years of observation, there were only 26 in the study area. &amp;nbsp;The trends in the other graphs are not apparent here, other than the general increase in the rate of fatal accidents in the second half of the observation period.&lt;br /&gt;
&lt;br /&gt;
&lt;h4&gt;


Summary&lt;/h4&gt;
&lt;div&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Accident rates in the control and treatment directions increased sharply around 1996, but neither effect is related in an obvious way to increased speed limits or deployment of the warning system.&lt;/li&gt;
&lt;li&gt;Accident rates were unusually low in the quarter the warning system was activated; other than that, no effect of the warning system is apparent.&lt;/li&gt;
&lt;li&gt;It looks like there was a second increase in accident rates in late 1999 or 2000. &amp;nbsp;I will ask my correspondent at Caltrans if he has an explanation.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;


Next steps&lt;/h4&gt;
&lt;br /&gt;
There's not much more I want to do with this data. &amp;nbsp;Now I need more numbers! &amp;nbsp;In particular, I will be able to get data from the warning system itself, including:&lt;br /&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Conditions measured at roadside weather stations, which should be better than the data I have from the airport 8 miles away, and&lt;/li&gt;
&lt;li&gt;Messages displayed when the warning system was active.&lt;/li&gt;
&lt;/ol&gt;
If the warning system has an effect, it should be apparent on the days it is active. &amp;nbsp;By comparing the treatment and control directions, it should be possible to quantify the effect.&lt;br /&gt;
&lt;br /&gt;
Also, I have permission now to share the data; I will try to get it posted, along with my code, before the next update.&lt;br /&gt;
&lt;br /&gt;
&lt;h4&gt;
[UPDATE April 26, 2012]&lt;/h4&gt;
A reader asked&lt;br /&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;i&gt;I can think of two ways that overall traffic volume affects accident rates: (1) more cars = more accidents overall, which you control for by measuring accident rates, and now you're seeing rising accident rates per car. So this raises the next thought, (2) more cars = more traffic density, which raises accident rates per car for each car on the road.&lt;/i&gt;&amp;nbsp;&lt;/blockquote&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;i&gt;What happens if you regress on traffic volume squared, or include traffic volume as an independent variable in the accident rate regression? The density effect is likely nonlinear but it's a thought.&lt;/i&gt;&lt;/blockquote&gt;
This is a great question. &amp;nbsp;If there is a non-linear relationship between traffic volume and the raw number of accidents, then even after we switch to accident rates, there might still be a positive relationship between traffic volume and accident rates.&lt;br /&gt;
&lt;br /&gt;
I ran these regressions, and in fact there is a relationship, but with the limitations of the data I have, I don't think it means much. &amp;nbsp;Specifically, I only have annual estimates for traffic volume, so there's no fluctuation over time; traffic volume increases at a nearly constant rate for the entire observation period (see the figure above).&lt;br /&gt;
&lt;br /&gt;
So traffic volume will have a positive relationship with anything else that's increasing, and a negative relationship with anything decreasing. &amp;nbsp;And that's what I see in the regressions:&lt;br /&gt;
&lt;br /&gt;
&lt;iframe frameborder="0" height="300" src="https://docs.google.com/spreadsheet/pub?key=0AnXTZBvB42kIdEFqaEh5V0xTeVRjSWg5a2VfRVdnQXc&amp;amp;single=true&amp;amp;gid=3&amp;amp;output=html&amp;amp;widget=true" width="500"&gt;&lt;/iframe&gt;
&lt;br /&gt;
&lt;br /&gt;
All of the relationships are statistically significant, but notice that in the treatment directions, before 1996 when the accident rate was declining, the relationship with traffic volume is negative!&lt;br /&gt;
&lt;br /&gt;
I don't think this variable has any explanatory content; any other ramp function would behave the same way. &amp;nbsp;If I can get finer-grain data on traffic volume, I might be able to look for a more meaningful effect.&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/ProbablyOverthinkingIt/~4/sUWUHte5w2M" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://allendowney.blogspot.com/feeds/1804350540193916200/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="http://allendowney.blogspot.com/2012/04/fog-warning-system-part-three.html#comment-form" title="10 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/1804350540193916200?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6894866515532737257/posts/default/1804350540193916200?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/ProbablyOverthinkingIt/~3/sUWUHte5w2M/fog-warning-system-part-three.html" title="Fog warning system: part three" /><author><name>Allen Downey</name><uri>https://plus.google.com/111942648516576371054</uri><email>noreply@blogger.com</email><gd:image rel="http://schemas.google.com/g/2005#thumbnail" width="32" height="32" src="//lh6.googleusercontent.com/-MMJ7uTh1QPA/AAAAAAAAAAI/AAAAAAAABAY/HBzGgWnGzQs/s512-c/photo.jpg" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/-qwJBmQ0Z0r4/T5gf2l6bM9I/AAAAAAAAAyk/jHkwdwYfyac/s72-c/caws.traffic.png" height="72" width="72" /><thr:total>10</thr:total><feedburner:origLink>http://allendowney.blogspot.com/2012/04/fog-warning-system-part-three.html</feedburner:origLink></entry></feed>
