Probably Overthinking It

This blog has moved

2018-09-26T06:48:00.000-07:00

As of September 2018, I am moving Probably Overthinking It to a new location.

Blogger has been an excellent host; so far, this blog has received more than two million page views.

But earlier this month, when I published a new article, Blogger prompted me to post it on Google+, and I did. A few hours later I discovered that my Google+ account had been suspended for violating terms of service, but I got no information about what terms I had violated.

While my Google+ account was suspended, I was unable to access Blogger and some other Google services. And since Probably Overthinking It is a substantial part of my professional web presence, that was unacceptable.

I appealed the suspension by pressing a button, with no opportunity to ask a question. Within 24 hours, my account was restored, but with no communication and still no information.

So for me, using Google+ has become a game of Russian Roulette. Every time I post something, there seems to be a random chance that I will lose control of my web presence. And maybe next time it will be permanent.

It is nice that using Blogger is free, but this episode has been a valuable reminder that "If you are not paying for it, you are not the customer". (Who said that?)

I have moved Probably Overthinking It to a site I control, hosted by a company I pay, a company that has provided consistently excellent customer service.

Lesson learned.

[When I published this article, Blogger asked if I wanted to post it on Google+. I did not.]

UPDATE: See the discussion of this post on Hacker News, with lots of good advice for migrating to services you have more control over.

Two hour marathon in 2031, maybe

2018-09-19T11:10:00.003-07:00

On Sunday (September 16, 2018) Eliud Kipchoge ran the Berlin Marathon in 2:01:39, smashing the previous world record by more than a minute and taking a substantial step in the progression toward a two hour marathon.

In a previous article, I noted that the marathon record pace since 1970 has been progressing linearly over time, and I proposed a model that explains why we might expect it to continue. Based on a linear extrapolation of the data so far, I predicted that someone would break the two hour barrier in 2041, plus or minus a few years.

Now it is time to update my predictions in light of the new record. The following figure shows the progression of world record pace since 1970 (orange line), a linear fit to the data (blue line) and a 90% predictive confidence interval (shaded area). The dashed lines show the two hour marathon pace (13.1 mph) and lower and upper bounds for the year we will reach it.

Since the previous record was broken in 2014, we have been slightly behind the long-term trend. But the new record more than makes up for it, putting us at the upper edge of the predictive interval.

This model predicts that we might see a two hour marathon as early as 2031, and probably will before 2041.

Note that this model is based on data from races. It is possible that we will see a two hour marathon sooner in under time trial conditions, as in the Nike Breaking2 project.

Tom Bayes and the case of the double dice

2018-09-13T05:41:00.000-07:00

The double dice problem

Suppose I have a box that contains one each of 4-sided, 6-sided, 8-sided, and 12-sided dice. I choose a die at random, and roll it twice without letting you see the die or the outcome. I report that I got the same outcome on both rolls.

1) What is the posterior probability that I rolled each of the dice?
2) If I roll the same die again, what is the probability that I get the same outcome a third time?

You can see the complete solution in this Jupyter notebook, or read the HTML version here.

Solution

Here's a BayesTable that represents the four hypothetical dice.

In [3]:

hypo = [Fraction(sides) for sides in [4, 6, 8, 12]]
table = BayesTable(hypo)

Out[3]:

	hypo	prior	likelihood	unnorm	posterior
0	4	1	NaN	NaN	NaN
1	6	1	NaN	NaN	NaN
2	8	1	NaN	NaN	NaN
3	12	1	NaN	NaN	NaN

Since we didn't specify prior probabilities, the default value is equal priors for all hypotheses. They don't have to be normalized, because we have to normalize the posteriors anyway.
Now we can specify the likelihoods: if a die has n sides, the chance of getting the same outcome twice is 1/n.
So the likelihoods are:

In [4]:

table.likelihood = 1/table.hypo
table

Out[4]:

	hypo	prior	likelihood	unnorm	posterior
0	4	1	1/4	NaN	NaN
1	6	1	1/6	NaN	NaN
2	8	1	1/8	NaN	NaN
3	12	1	1/12	NaN	NaN

Now we can use update to compute the posterior probabilities:

In [5]:

table.update()
table

Out[5]:

	hypo	prior	likelihood	unnorm	posterior
0	4	1	1/4	1/4	2/5
1	6	1	1/6	1/6	4/15
2	8	1	1/8	1/8	1/5
3	12	1	1/12	1/12	2/15

In [6]:

table.posterior.astype(float)

Out[6]:

0    0.400000
1    0.266667
2    0.200000
3    0.133333
Name: posterior, dtype: float64

The 4-sided die is most likely because you are more likely to get doubles on a 4-sided die than on a 6-, 8-, or 12- sided die.

Part two

The second part of the problem asks for the (posterior predictive) probability of getting the same outcome a third time, if we roll the same die again.
If the die has n sides, the probability of getting the same value again is 1/n, which should look familiar.
To get the total probability of getting the same outcome, we have to add up the conditional probabilities:

P(n | data) * P(same outcome | n)

The first term is the posterior probability; the second term is 1/n.

In [7]:

total = 0
for _, row in table.iterrows():
    total += row.posterior / row.hypo
    
total

Out[7]:

Fraction(13, 72)

This calculation is similar to the first step of the update, so we can also compute it by
1) Creating a new table with the posteriors from table.
2) Adding the likelihood of getting the same outcome a third time.
3) Computing the normalizing constant.

In [8]:

table2 = table.reset()
table2.likelihood = 1/table.hypo
table2

Out[8]:

	hypo	prior	likelihood	unnorm	posterior
0	4	2/5	1/4	NaN	NaN
1	6	4/15	1/6	NaN	NaN
2	8	1/5	1/8	NaN	NaN
3	12	2/15	1/12	NaN	NaN

In [9]:

table2.update()

Out[9]:

Fraction(13, 72)

In [10]:

table2

Out[10]:

	hypo	prior	likelihood	unnorm	posterior
0	4	2/5	1/4	1/10	36/65
1	6	4/15	1/6	2/45	16/65
2	8	1/5	1/8	1/40	9/65
3	12	2/15	1/12	1/90	4/65

This result is the same as the posterior after seeing the same outcome three times.

This example demonstrates a general truth: to compute the predictive probability of an event, you can pretend you saw the event, do a Bayesian update, and record the normalizing constant.
(With one caveat: this only works if your priors are normalized.)

The Physics of Bungee Jumping

2018-07-10T10:18:00.003-07:00

Bungee jumping turns out to be more complicated than I realized. I use bungee jumping as an example in Modeling and Simulation in Python, which I am revising this summer. The book is not done, but you can see the current draft here.

During the first phase of the jump, before the cord is fully extended, I treat the jumper as if they are in free fall, including the effect of gravity and air resistance, but ignoring the interaction between the jumper and the cord.

It turns out that this interaction is non-negligible. As the cord drops from its folded initial condition to its extended final condition, it loses potential energy. Where does that energy go? It is transferred to the jumper!

The following diagram shows the scenario, courtesy of this web page on the topic:

The acceleration of the jumper turns out to be

where a is the net acceleration of the jumper, g is acceleration due to gravity, v is the velocity of the jumper, y is the position of the jumper relative to the starting point, L is the length of the cord, and μ is the mass ratio of the cord and jumper.

For a bungee jumper with mass 75 kg, I've computed the trajectory of a jumper with and without the effect of the cord. The difference is more than two meters, which could be the difference between a successful jump and a bad day.

The details are in this Jupyter notebook.

Inference in three hours

2018-06-22T08:52:00.000-07:00

I am preparing a talk for the Joint Statistical Meetings (JSM 2018) in August. It's part of a session called "Bringing Intro Stats into a Multivariate and Data-Rich World"; my talk is called "Inference in Three Hours, and More Time for the Good Stuff".

Here's what I said I would talk about:

Teaching statistical inference using mathematical methods takes too much time, emphasizes the least important material, and leaves many students unprepared to apply statistics in the real world. Simple computer simulations can demonstrate the fundamental ideas of statistical inference quickly, clearly, and memorably. Computational methods are also robust and flexible, making it possible to work with a wider range of data and experiments. And by teaching statistical inference better and faster, we leave time for the most important goals of statistics education: preparing students to use data to answer questions and guide decision making under uncertainty. In this talk, I discuss problems with current approaches and present educational material I have developed based on computer simulations in Python.

I have slides for the talk now:

And here's the Jupyter notebook they are based on.

I have a few weeks until the conference, so comments and suggestions are welcome.

----

Coincidentally, I got question on Twitter today that's related to my talk:

Very late to this post by @AllenDowney, but quite informative: http://allendowney.blogspot.com/2015/11/recidivism-and-single-case-probabilities.html …
Have one question though: seems a lot of the single case reasoning here is similar to what I was taught was a mistaken conclusion: “that there is a 95% prob that a parameter lies within the given 95% CI.” What is the difference? Seems I am missing some nuance?

The post @cutearguments asks about is "Recidivism and single-case probabilities", where I make an argument that single-case probabilities are not a special problem, even under the frequentist interpretation of probability; they only seem like a special problem because they make the reference class problem particularly salient.

So what does that have to do with confidence intervals? Let me start with the example in my talk: suppose you are trying to estimate the average height of men in the U.S. You collect a sample and generate an estimate, like 178 cm, and a 95% confidence interval, like (177, 179) cm.

Naively, it is tempting to say that there is a 95% chance that the true value (the actual average height of every male resident in the population) falls in the 95% confidence interval. But that's not true.

There are two reasons you might hear for why it's not true:

1) The true value is unknown, but it is not a random quantity, so it is either in the interval or it's not. You can't assign a probability to it.

2) The 95% confidence interval does not have a 95% chance of containing the true value because that's just not what it means. A confidence interval quantifies variability due to random sampling; that's all.

The first argument is bogus; the second is valid.

If you are a Bayesian, the first argument is bogus because it is entirely unproblematic to make probability statements about unknown quantities, whether they are considered random or not.

If you are a frequentist, the first argument is still bogus because even if the true value is not a random quantity, the confidence interval is. And furthermore, it belongs to a natural reference class, the set of confidence intervals we would get by running the experiment many times. If we agree to treat it as a member of that reference class, we should have no problem giving it a probability of containing the true value.

But that probability is not 95%. If you want an interval with a 95% chance of containing the true value, you need a Bayesian credible interval.

Bayesian Zig Zag

2018-06-01T07:22:00.002-07:00

Almost two years ago, I had the pleasure of speaking at the inaugural meeting of the Boston Bayesians, where I presented "Bayesian Bandits from Scratch" (the notebook for that talk is here). Since then, the group has flourished, thanks to the organizers, Jordi Diaz and Colin Carroll.

Last night I made my triumphant return for the 21st meeting, where I presented a talk I called "Bayesian Zig Zag". Here's the abstract:

Tools like PyMC and Stan make it easy to implement probabilistic models, but getting started can be challenging. In this talk, I present a strategy for simultaneously developing and implementing probabilistic models by alternating between forward and inverse probabilities and between grid algorithms and MCMC. This process helps developers validate modeling decisions and verify their implementation.

As an example, I will use a version of the "Boston Bruins problem", which I presented in Think Bayes, updated for the 2017-18 season. I will also present and request comments on my plans for the second edition of Think Bayes.

When I wrote the abstract, I was confident that the Bruins would be in the Stanley Cup final, but that is not how it worked out. I adapted, using results from the first two games of the NHL final series to generate predictions for the next game.

Here are the slides from the talk:

And here is the Jupyter notebook I presented. If you want to follow along, you'll see that there is a slide that introduces each section of the notebook, and then you can read the details. If you have a Python installation with PyMC, you can download the notebook from the repository and try it out.

The talk starts with basic material that should be accessible for beginners, and ends with a hierarchical Bayesian model of a Poisson process, so it covers a lot of ground! I hope you find it useful.

For people who were there, thank you for coming (all the way from Australia!), and thanks for the questions, comments, and conversation. Thanks again to Jordi and Colin for organizing, to WeWork for hosting, and to QuantumBlack for sponsoring.

Some people hate custom libraries

2018-05-03T09:02:00.000-07:00

For most of my books, I provide a Python module that defines the functions and objects I use in the book. That makes some people angry.

The following Amazon review does a nice job of summarizing the objections, and it demonstrates the surprising passion this issue evokes:

1.0 out of 5 starsRuined by idiotic and unnecessary and MASSIVE complexity in stupidly designed custom code

March 29, 2018

Format: Paperback

Echoing another reviewer, the custom code requirement means you learn their custom code rather than, you know, the standard modules numpy and scipy. For example, at least four separate classes are required, representing hundreds of lines of code, are required just to execute the first six lines of code in the book. All those lines do is define two signals, a cosine and a sine, sums them, then plots them. This, infuriatingly, hides some basic steps. Here's how you can create a cosine wave with frequency 440Hz:

duration = 0.5
framerate = 11025
n = round(duration*framerate)
ts = np.arange(n)/framerate
amp = 1.0
freq = 440
offset = 0.0
cos_sig = amp * numpy.cos( 2*numpy.pi*ts*freq + offset)
freq = 880
sin_sig = amp * numpy.sin( 2*numpy.pi*ts*freq + offset)

Instead, these clowns have

cos_sig = thinkdsp.CosSignal(freq=440,amp=1.0,offset=0)
sin_sig = thinkdsp.SinSignal(freq=440,amp=1.0,offset=0)
mix = cos_sig + sin_sig

where CosSignal and SinSignal are custom classes, not functions, which inherits four separate classes, NONE of which are necessary, and all of which serve to make things more complex than necessary, on the pretense this makes things easier. The classes these class inherit are a generic Sinusoid and SumSignal classes, which inherits a Signal class, which depends on a Wave class, which performs plotting using pyplot in matplotlib. None of which make anything really any easier, but does serve to hide a lot of basic functionality, like hiding how to use numpy, matplotlib, and pyplot.

In short, just to get through the first two pages, you have to have access to github to import their ridiculous thinkdsp, thinkplot, and thinkstats, totalling around 5500 lines of code, or you are just screwed and can't use this book. All decent teaching books develops code you need as necessary and do NOT require half a dozen files with thousands of lines of custom code just to get to page 2. What kind of clown does this when trying to write a book to show how to do basic signal processing? Someone not interested in teaching you DSP, but trying to show off their subpar programming skills by adding unnecessary complexity (a sure sign of a basic programmer, not a good).

The authors openly admit their custom code is nothing more than wrappers in numpy and scipy, so the authors KNEW they were writing a crappy book and filling it with a LOT of unnecessary complexity. Bad code is bad code. Using bad code to teach makes bad teaching. It's obvious Allen B. Downey has spent his career in academia, where writing quality code doesn't matter.

Well, at least he spelled my name right.

Maybe I should explain why I think it's a good idea to provide a custom library along with a book like Think DSP. Importantly, the goal of the book is to help people learn the core ideas of signal processing; the software is a means to this end.

Here's what I said in the preface:

The premise of this book is that if you know how to program, you can use that skill to learn other things, and have fun doing it.

With a programming-based approach, I can present the most important ideas right away. By the end of the first chapter, you can analyze sound recordings and other signals, and generate new sounds. Each chapter introduces a new technique and an application you can apply to real signals. At each step you learn how to use a technique first, and then how it works.

For example, in the first chapter, I introduce two objects defined in thinkdsp.py: Wave and Spectrum. Wave provides a method called make_spectrum that creates a Spectrum object, and Spectrum provides make_wave, which creates a Wave.

When readers use these objects and methods, they are implicitly learning one of the fundamental ideas of signal processing: that a Wave and its Spectrum are equivalent representations of the same information -- given one, you can always compute the other.

This example demonstrates one reason I use custom libraries in my books: The API is the lesson. As you learn about these objects and how they interact, you are also learning the core ideas of the topic.

Another reason I think these libraries are a good idea is that they let me introduce ideas top-down: that is, I can show what a method does -- and why it is useful -- first; then I can present details when they necessary or most useful.

For example, I introduce the Spectrum object in Chapter 1. I use it to apply a low pass filter, and the reader can hear what that sounds like. You can too, by running the Chapter 1 notebook on Binder.

In Chapter 2, I reveal that my make_spectrum function is a thin wrapper on two NumPy functions, and present the source code:

from np.fft import rfft, rfftfreq

# class Wave:
    def make_spectrum(self):
        n = len(self.ys)
        d = 1 / self.framerate

        hs = rfft(self.ys)
        fs = rfftfreq(n, d)

        return Spectrum(hs, fs, self.framerate)

At this point, anyone who prefers to use NumPy directly, rather than my wrappers, knows how.

In Chapter 7, I unwrap one more layer and show how the FFT algorithm works. Why Chapter 7? Because I introduce correlation in Chapter 5, which helps me explain the Discrete Cosine Transform in Chapter 6, which helps me explain the Discrete Fourier Transform.

Using custom libraries lets me organize the material in the way I think works best, based on my experience working with students and seeing how they learn.

This example demonstrates another benefit of defining my own objects: data encapsulation. When you use NumPy's rfft to compute a spectrum, you get an array of amplitudes, but not the frequencies they correspond to. You can call rfftfreq to get the frequencies, and that's fine, but now you have two arrays that represent one spectrum. Wouldn't it be nice to wrap them up in an object? That's what a Spectrum object is.

Finally, I think these examples demonstrate good software engineering practice, particularly bottom-up design. When you work with libraries like NumPy, it is common and generally considered a good idea to define functions and objects that encapsulate data, hide details, eliminate repeated code, and create new abstractions. Paul Graham wrote about this idea in one of his essays on software:

[...] you don't just write your program down toward the language, you also build the language up toward your program. [...] the boundary between language and program is drawn and redrawn, until eventually it comes to rest along [...] the natural frontiers of your problem. In the end your program will look as if the language had been designed for it.

That's why, in the example that makes my correspondent so angry, it takes just three lines to create and add the signals; and more importantly, those lines contain exactly the information relevant to the operations and no more. I think that's good quality code.

In summary, I provide custom libraries for my books because:

1) They demonstrate good software engineering practice, including bottom-up design and data encapsulation.

2) They let me present ideas top-down, showing how they are used before how they are implemented.

3) And as readers learn the APIs I defined, they are implicitly learning the key ideas.

I understand that not everyone agrees with this design decision, and maybe it doesn't work for everyone. But I am still surprised that it makes people so angry.

Computing at Olin Q&A

2018-04-18T11:15:00.000-07:00

I was recently interviewed by Sally Phelps, the Director of Postgraduate Planning at Olin. We talked about computer science at Olin, which is something we are often asked to explain to prospective students and their parents, employers, and other external audiences.

Afterward, I wrote the following approximation of our conversation, which I have edited to be much more coherent than what I actually said.

I should note: My answers to the following questions are my opinions. I believe that other Olin professors who teach software classes would say similar things, but I am sure we would not all say the same things.

Photo Credit: Sarah Deng

Q: What is the philosophy of Olin when it comes to training software engineers of the future?

To understand computer science at Olin, you have to understand that Olin really has one curriculum, and it's engineering.

We have degrees in Engineering, Mechanical Engineering, and Electrical and Computer Engineering. But everyone sees the same approach to engineering: it starts with people and it ends with people. That means you can't wait for someone to hand you a well-formulated problem from a textbook; you have to understand the people you are designing for, and the context of the problem. You have to know when an engineering solution can help and when it might not. And then when you have a solution, you have to be able to get it out of the lab and into the world.

Q: That sounds very different from a traditional computer science degree.

It is. Because we already have a lot of computer scientists who know how data structures work; we don't have as many who can identify opportunities, work on open-ended problems, work on teams with people from other disciplines, work on solutions that might involve electrical and mechanical systems as well as software.

And we don't have a lot of computer scientists who can communicate clearly about their work; to have impact, they have to be able to explain the value of what they are doing. Most computer science programs don't teach those things very well.

Also most CS programs don't do a great job of preparing students to work as software engineers. A lot of classes are too theoretical, too mathematical, and too focused on the computer itself, not the things you want to do with it, the applications.

At Olin, we've got some theory, some mathematical foundations, some focus on the design of software systems. But we've turned that dial down because the truth is that a lot of that material is not relevant to practice. I always get a fight when I say that, because you can never take anything out of the curriculum. There's always someone who says you have to know how to balance a red-black tree or you can't be a computer scientist; or you have to know about Belady's anomaly, or you have to know X, Y, and Z.

Well, you don't. For the vast majority of our students, for all the things they are going to do, a big chunk of the traditional curriculum is irrelevant. So we look at the traditional curriculum with some skepticism, and we make cuts.

We have to; there's only so much time. In four years, students take about 32 classes. We have to spend them wisely. We have to think about where they are going after graduation. Some will go to grad school, some will start companies, some will work in industry, Some of them will be software engineers, some will be product managers, some will work in other fields; they might develop software, or work with software developers.

Q: So how do you prepare people for all of that?

It depends what "prepare" means. If it means teach them everything they need to know, it's impossible. But you can identify the knowledge, skills and attitudes they are most likely to need.

It helps if you have faculty with industry experience. A lot of professors go straight to grad school and straight into academics, and then they have long arguments about what software engineers need to know. Sometimes they don't know what they are talking about.

If you're designing a curriculum, just like a good engineer, you have to understand the people you're designing for and the context of the solution. Who are your students, where are they going, and what are they going to need? Then you can decide what to teach.

Q: So if a student is interested in CS and they're deciding between Olin and another school, what do you tell them?

I usually tell them about the Olin curriculum, what I just explained. And I suggest they look at our graduation requirements. Students at Olin who do the Engineering major with a concentration in computing, they take a relatively small number of computer science classes, usually around seven. And they take a lot of other engineering classes.

In the first semester, everyone takes the same three engineering classes, so everyone does some mechanical design, some circuits and measurement, and some computational modeling.

Everyone takes a foundation class in humanities, and another in entrepreneurship. Everyone takes Principles of Engineering, where they design and build a mechatronic system.

In the fourth semester everyone takes user-centric design, and finally, in the senior year, everyone does a two-semester engineering capstone, which is usually interdisciplinary.

If a prospective student looks at those classes and they're excited about doing design and engineering -- and several kinds of engineering -- along with computer science, then Olin is probably a good choice for them.

If they look at the requirements and they dread them -- if the requirements are preventing them from doing what they really want -- then maybe Olin's not the right place.

Q: I understand there are student-taught software classes – can you tell us more about that?

We do, and a lot of them have been related to software, because that's an area where we have students doing internships, and sometimes starting companies, and they get a lot of industry experience.

And they come back with skills and knowledge they can share with their peers. Sometimes that happens in classes, especially on projects. But it can also be a student-led class where student instructors propose a class, and they they work with faculty advisors to develop and present the material. As an advisor, I can help with curriculum design and the pedagogy, and sometimes I have a good view of the context or the big picture. And then a lot of times the students have a better view of the details. They've spent the summer working in a particular domain, or with a particular technology, and they can help their peers get a jump start.

They also bring some of the skills and attitudes of software engineering. For example, we teach students about testing, and version control, and code quality. But in a class it can be artificial; a lot of times students want to get code working and they have to move on to the next thing. They don't want to hear from me about coding "style".

It can be more effective when it's coming from peers, and when it's based on industry experience. The student instructors might say they worked at Pivotal, and they had to do pair programming, or they worked at Google, and all of their code was reviewed for readability before they could check it in. Sometimes that's got more credibility than I do.

Q: What does the future look like for computing at Olin?

A big part of it is programming in context. For example, the first software class is Modeling and Simulation, which is about computational models in science, including physics, chemistry, medicine, ecology… So right from the beginning, we're not just learning to program, we're applying it to real world problems.

Programming isn't just a way of translation well understood solutions into code. It's a way of communicating, teaching, learning, and thinking. Students with basic programming skills can use coding as a "pedagogic lever" to learn other topics in engineering, math, natural and social science, arts and humanities.

I think we are only starting to figure out what that looks like. We have some classes that use computation in these ways, but I think there are a lot more opportunities. A lot of ideas that we teach mathematically, we could be doing computationally, maybe in addition to, or maybe instead of the math.

One of my examples is signal processing, where probably the most important idea is the Fourier transform. If you do that mathematically, you have to start with complex numbers and work your way up. It takes a long time before you get to anything interesting.

With a computational approach, I can give you a program on the first day to compute the Fourier transform, and you can use it, and apply it to real problems, and see what it does, and run experiments and listen to the results, all on day one. And now that you know what it's good for, maybe you'll want to know how it works. So we can go top-down, start with applications, and then open the hood and look at the engine.

I'd like to see us apply this approach throughout the curriculum, especially engineering, math, and science, but also arts, humanities and social science.

Generational changes in support for gun laws

2018-03-22T08:06:00.000-07:00

This is the fourth article in a series about changes in support for gun control laws over the last 50 years.

In the first article I looked at data from the General Social Survey and found that young adults are less likely than previous generations to support gun control.

In the second article I looked at data from the CIRP Freshman Survey and found that even the youngest adults, who grew up with lockdown drills and graphic news coverage of school shootings, are LESS likely to support strict gun control laws.

In the third article, I ran graphical tests to distinguish age, period, and cohort effects. I found strong evidence for a period effect: support for gun control among all groups increased during the 1980s and 90s, and has been falling in all groups since 2000. I also saw some evidence of a cohort effect: people born in the 1980s and 90s are less likely to support strict gun control laws.

In this article, I dive deeper, using logistic regression to estimate the sizes of these effects separately, while controlling for demographic factors like sex, race, urban or rural residence, etc.

Variables

As in the previous articles, I am using data from the General Social Survey (GSS), and in particular the variable 'gunlaw', which records responses to the question:

Would you favor or oppose a law which would require a person to obtain a police permit before he or she could buy a gun?

I characterize respondents who answer "favor" to be more likely to support strict gun control laws.

The explanatory factors I consider are:

'nineties', 'eighties', 'seventies', 'fifties', 'forties', 'thirties', 'twenties': These variables encode the respondents decade of birth.

'female': indicates that the respondent is female.

'black': indicates that the respondent is black.

'otherrace': indicates that the respondent is neither white nor black (most people in this category are mixed race).

'hispanic': indicates that the respondent is Hispanic.

'conservative', 'liberal': indicates that the respondent reports being conservative or liberal (not moderate).

'lowrealinc', 'highrealinc': indicates that the respondent's household income is in the bottom or top 25%, based on self-report, converted to constant dollars.

'college': indicates whether the respondent attended any college.

'urban', 'rural': indicates whether the respondent lives in an urban or rural area (not suburban).

'gunhome': indicates whether the respondent reports that they "have in [their] home or garage any guns or revolvers".

'threatened': indicates whether the respondent reports that they have "ever been threatened with a gun, or shot at".

These factors are all binary. In addition, I also estimate the period effect by including the following variables: 'yminus10', 'yminus20', 'yminus30', and 'yminus40', to indicate respondents surveyed 10, 20, 30, and 40 years prior to 2016.

Results

I used logistic regression to estimate the effect of each of these variables. The regression also includes a cubic model of time, intended to capture the period effect. You can see the period effect in the following figure, which shows actual changes in support for a gun permit law over the history of the GSS (in gray) and the retroactive predictions of the model (in red).

To present the results in an interpretable form, I define a collection of hypothetical people with different attributes and estimate the probability that each one would favor a gun permit law.

As a baseline, I start with a white, non-Hispanic male born in the 1960s who is politically moderate, in the middle 50% of the income range, who attended college, lives in a suburb, has never been threatened with a gun or shot at, and does not have a gun at home. People like that interviewed in 2016 have a 74% chance of favoring "a law which would require a person to obtain a police permit before he or she could buy a gun".

The following table shows results for people with different attributes: the first row, which is labeled 'baseline' is the baseline person from the previous paragraph; the second row, labeled 'nineties', is identical to the baseline in every way, except born in the 1990s rather than the 1960s. The line labeled 'female' is identical to the baseline, but female.

These results are generated by running 201 random samples from the GSS data and computing the median, 2.5th and 97.5th percentiles. The range from 'low2.5' to 'high97.5' forms a 95% confidence interval.

	low2.5	median	high97.5
baseline	71.5	73.6	75.1
nineties	60.1	63.8	68.5
eighties	64.9	67.7	69.8
seventies	67.8	70.1	72.0
fifties	70.7	72.8	74.8
forties	71.4	73.7	75.4
thirties	69.9	72.0	73.8
twenties	70.7	72.9	74.8
female	83.2	84.7	85.6
black	75.9	78.0	79.5
otherrace	78.2	80.4	82.9
hispanic	70.7	73.5	76.0
conservative	64.1	65.8	67.8
liberal	75.9	77.7	79.3
lowrealinc	69.1	71.1	72.9
highrealinc	73.2	75.2	76.7
college	73.2	75.1	76.3
urban	66.0	68.3	69.9
rural	59.5	62.0	64.0
threatened	68.8	71.1	73.1
gunhome	51.1	53.6	55.8
yminus10	84.1	85.3	86.2
yminus20	84.3	85.3	86.4
yminus30	80.7	81.5	82.8
yminus40	78.8	79.9	81.4
lowest combo	16.5	19.2	22.1
highest combo	91.3	92.3	93.4

Again, the hypothetical baseline person has a 74% chance of favoring a gun permit law. A nearly identical person born in the 1990s has only a 64% chance.

To see this and the other effects more clearly, I computed the difference between each hypothetical person and the baseline, then sorted by the magnitude of the apparent effect.

	low2.5	median	high97.5
lowest combo	-57.7	-54.5	-49.5
gunhome	-21.1	-20.0	-18.6
rural	-13.4	-11.5	-9.9
nineties	-13.5	-9.8	-4.5
conservative	-8.9	-7.6	-6.5
eighties	-7.9	-5.9	-2.7
urban	-6.6	-5.4	-4.1
seventies	-5.6	-3.2	-1.4
lowrealinc	-3.7	-2.5	-1.2
threatened	-3.6	-2.4	-1.1
thirties	-3.2	-1.5	-0.1
fifties	-2.2	-0.8	0.8
twenties	-2.8	-0.6	1.0
forties	-1.7	-0.0	1.5
baseline	0.0	0.0	0.0
hispanic	-1.5	0.2	2.0
college	0.6	1.6	2.6
highrealinc	0.7	1.7	2.6
liberal	2.9	4.2	5.2
black	3.1	4.5	5.8
yminus40	4.9	6.6	8.0
otherrace	4.8	6.9	9.6
yminus30	6.6	8.1	9.8
female	10.1	11.1	12.4
yminus20	10.2	11.7	14.0
yminus10	10.1	11.7	13.7
highest combo	16.9	18.8	21.1

All else being equal, someone who owns a gun is about 20 percentage points less likely to favor gun permit laws.

Compared to people born in the 1960s, people born in the 1990s are 10 points less likely. People born in the 1980s and 1970s are also less likely, by 6 and 3 points. People born in previous generations are not substantially different from people born in the 1960s (and the effect is not statistically significant).

Compared to suburbanites, people in rural and urban communities are less likely, by 12 and 5 points.

People in the lowest 25% of household income are less likely by 2.5 points; people in the highest 25% are more likely by 2 percentage point.

Blacks and other non-whites are more likely to favor gun permit laws, by 4.5 and 7 percentage points.

Compared to political moderates, conservatives are 8 points less likely and liberals are 4 points more likely to favor gun permit laws.

Compared to men, women are 11 points more likely to favor gun permit laws.

Controlling for all of these factors, the period effect persists: people with the same attributes surveyed 10, 20, 30, and 40 years ago would have been 10, 10, 7, and 5 points more likely to favor gun permit laws.

In these results, Hispanics are not significantly different from non-Hispanic whites. But because of the way the GSS asked about Hispanic background, this variable is missing a lot of data; these results might not mean much.

Surprisingly, people who report that they have been "threatened with a gun or shot at" are 2 percentage points LESS likely to favor gun permit laws. This effect is small but statistically significant, and it is consistent in many versions of the model. A possible explanation is that this variable captures information about the respondent's relationship with guns that is not captured by other variables. For example, if a respondent does not have a gun at home, but spends time around people who do, they might be more likely to have been threatened and also more likely to share cultural values with gun owners. Alternatively, since this question was only asked until 1996, it's possible that it is capturing a period effect, at least in part.

"Lowest combo" represents a hypothetical person with all attributes associated with lower support for gun laws: a white conservative male born in the 1990s, living in a rural area, with household income in the lowest 25%, who has not attended college, who owns a gun, and has been threatened with a gun or shot at. Such a person has a 19% change of favoring a gun permit law, 54 points lower than the baseline.

"Highest combo" represents a hypothetical person with all attributes associated with higher support for gun laws: a mixed race liberal woman born in the 1960s or before, living in a suburb, with household income in the highest 25%, who has attended college, does not own a gun, and has not been threatened with one. Such a person has a 92% chance of favoring a gun permit law, 19 points higher than the baseline.

[You might be surprised that these results as asymmetric: that is, that the lowest combo is farther from the baseline than the highest combo. The reason is that the "distance" between probabilities is not linear. For more about that, see my previous article on the challenges of interpreting probablistic predictions].

Methodology

The entire analysis in this Jupyter notebook. The steps are:

1) Load the subset of GSS data I selected, which you can download here.

2) For each year of the survey, use weighted bootstrap to select a random sample that accounts for the stratified sampling in the GSS design.

3) Fill missing values in each column by drawing random samples from the valid responses.

4) Convert some numerical and categorical variables to boolean; for example 'conservative' and 'liberal' are based on the categorical variable 'polviews'; and 'lowrealinc' and 'highrealinc' are based on the numerical variable 'realinc'.

5) Use logistic regression to estimate model parameters, which are in terms of log odds.

6) Use the model to make predictions for each of the hypothetical people in the tables, in terms of probabilities.

7) Compute the predicted difference between each hypothetical person and the baseline.

By repeating steps (2) through (7) about 200 times, we get a distribution of estimates that accounts for uncertainty due to random sampling and missing values. From these distributions, we can select the median and a 95% confidence interval, as reported in the tables above.

Support for gun control is decreasing in all age groups

2018-03-01T08:37:00.002-08:00

This is the third article in a series about changes in support for gun control over the last 50 years.

In the first article I looked at data from the General Social Survey and found that young adults are less likely than previous generations to support gun control.

In the second article I looked at data from the CIRP Freshman Survey and found that even the youngest adults, who grew up with lockdown drills and graphic news coverage of school shootings, are still LESS likely to support gun control.

Untangling age, period, and cohort effects

In this article, I do some age-period-cohort analysis to see if the changes over the last 50 years are due to age, period, or cohort effects:

Age effect: People's views might change over the course of their lives. For example, they might be more likely to support gun rights when they are young, and more likely to support gun control when they have children. (This turns out not to be true.)

Period effect: People's views might change due to an external factor that affects all age groups and cohorts over the same time period. For example, if gun crime rates increase, we might expect support for gun control to increase. (There is some evidence for this.)

Cohort effect: People's views might be different from one generation to the next, due to differences in the environment. For example, current teenagers might support gun control because of their experiences with school shootings. (This turns out not to be true.)

As the composition of the population changes over time, it can be hard to untangle these effects, but the design of the Generation Social Survey (GSS) makes it possible. From 1972 to 2016, the GSS asked respondents

Would you favor or oppose a law which would require a person to obtain a police permit before he or she could buy a gun?

The following figure shows the fraction of respondents who would favor this law:

In the 1970s and 80s, support for this policy was near 75%. It increased during the 1990s, peaking near 85% around 2000, and has been declining ever since. In the most recent survey year, it is at 71%.

Testing for age effects

To test for age effects, we can group respondents into cohorts by decade of birth and plot support for gun control as a function of age.

If there is an age effect we would expect all cohorts to follow a similar trajectory as they age. For example, if people are more likely to support gun control during their child-bearing years, we would expect these line to generally increase from left to right.

Here are the results:

There are no obvious patterns here, which suggest that there is no age effect.

Testing for period effects

To test for period effects, we group by decade of birth again, and plot the results over time. If there is a period effect, we expect all cohorts to follow a similar trajectory.

Here are the results:

This figure shows clear evidence for a period effect: all cohorts follow a similar trajectory over the same period. (Don't be distracted by the extreme first points in the green and purple curves; they are based on a small number of respondents.)

Looking at the last few points in each cohort, it looks like people born in the 1980s and 90s are less likely to support gun control than previous generations, but this figure does not show strong evidence for a cohort effect.

In summary, there is strong evidence for a period effect: support for gun control increased among all groups increased during the 1980s and 90s, and has been falling in all groups since 2000.

Violent crime rates

A possible explanation is that these trends are driven by changes in violent crime, especially gun violence, which increased during the 1980s, peaked in 1993, and has been falling ever since, according to this study from the Pew Research Center.

To investigate this more carefully, I would like to see a graph of people's perception of violent crime rates, which does not always track reality.

Breakdown by political views

In general, liberals are more likely to support gun control than conservatives; we might expect a period effect to have different impact on different groups. The following figure shows support for gun control over time, grouped by self-reported political identity:

Whatever external forces caused the increase and subsequent decrease in support for gun control, it affected all groups over the same period. The most recent decreases seems to be bigger among conservatives, so the gap may be growing.

Breakdown by race

Nonwhites are more likely to support gun control than whites by about 8 percentage points. The following figure shows how this difference has changed over time:

Both groups were affected similarly over the same period. Among nonwhites, support for gun control might have increased sooner, in the 1980s, and might be falling more slowly now.

Post-Columbine students do not support gun control

2018-02-28T07:20:00.003-08:00

In their coverage of the Parkland school shooting, The Economist writes:

Though polling suggests that young people are only slightly more in favour of gun-control measures than their elders, those surveys focus on those aged 18 and above. There may be a pre- and post-Columbine divide within that group.

Based on my analysis of data from the General Social Survey (GSS) and the CIRP Freshman Survey, I think the first sentence is false and the second is unlikely: young people are substantially less in favor of gun-control measures than their elders.

Here's the figure, from my previous article, showing these trends:

The blue line shows the fraction of respondents in the GSS who would favor "a law which would require a person to obtain a police permit before he or she could buy a gun?"

Among people born before 1980, support for this form of gun control is strong: around 75% for people born between 1910 and 1940, and approaching 80% for people born between 1950 and 1980.
But among people born in the 1980s and 90s, support for gun control is below 70%.

The orange line shows the fraction of respondents to the CIRP Freshman Survey who "Agree" or "Strongly agree" that

The federal government should do more to control the sale of handguns.

This dataset does not go back as far, but shows the same pattern: a large majority of people born before 1980 supported gun control (when they were surveyed as college freshmen); among people born after 1980, far fewer support gun control.

However, these results are based on people people who are currently young adults. Maybe, as the Economist speculates:

The pupils, in their late teens, started their education after a massacre at Columbine High School in Colorado in 1999, in which 13 were killed. That means they have been practising active-shooter drills in the classroom since kindergarten. Seeing a school shooting as an event to prepare for, rather than an awful aberration, seems to have fuelled the students’ anger.

They may be angry, but at least so far, their anger has not led them to support gun control. Data from the Freshman Survey makes this clear. The following figure shows, for survey respondents from 1989 to 2013, the fraction that agree or strongly agree that:

The federal government should do more to control the sale of handguns.

And for respondents in 2016, the fraction that agree or strongly agree that:

The federal government should have stricter gun control laws.

The change in wording makes it hard to compare the last data point with the previous trend, but it is clear at least that college freshmen in 2013 were substantially less likely than previous generations to support gun control: at 64%, they were 20 percentage points down from the peak, at 84%.

A large majority of the 2013 respondents were born in 1995. They were 3 when Columbine was in the news, 10 during the Red Lake shootings, 11 during the West Nickel Mines school shooting, 12 during the Virginia Tech shooting, and 13 during the Northern Illinois University shooting.

They were 17 during the Chardon High School shooting, the Oikos University shooting, and the Sandy Hook Elementary School shooting.

And when they were surveyed in 2013, less than a year after Sandy Hook, more than 33% of them did not agree that the federal government should do more to control the sale of handguns, more than in any previous year of the survey.

Seeing these horrific events in the news, during their entire conscious lives, with increasingly dramatic and graphic coverage, might have made these students angry, but it did not make more of them support gun control.

Practicing active-shooter drills since kindergarten might have made these students angry, but it did not make more of them support gun control.

Maybe, as The Economist suggests, these students see a school shooting as "an event to prepare for, rather than an awful aberration". But that does not make them more likely to support gun control.

Support for gun control is lower among young adults

2018-02-27T11:16:00.001-08:00

In current discussions of gun policies, many advocates of gun control talk as if time is on their side; that is, they assume that young people are more likely than old people to support gun control.

This letter to the editor of the Economist summarizes the argument:

It is unlikely that a generation raised on lockdown drills, with access to phone footage of gun rampages and a waning interest in hunting, will grow up parroting the National Rifle Association’s rhetoric as enthusiastically as today's political leaders. Change is coming.

And in a recent television interview, a survivor of the Parkland school shooting told opponents of gun control:

You might as well stop now, because we are going to outlive you.

But these assumptions turn out to be false. In fact, young adults are substantially less likely to support gun control than previous generations.

The following figure shows results I generated from the General Social Survey (GSS) and the CIRP Freshman Survey, plotting support for gun control by year of birth.

The blue line shows the fraction of respondents in the GSS who answered "Favor" to the following question:

Would you favor or oppose a law which would require a person to obtain a police permit before he or she could buy a gun?

Among people born before 1980, support for this form of gun control is strong: around 75% for people born between 1910 and 1940, and approaching 80% for people born between 1950 and 1980.

But among people born in the 1980s and 90s, support for gun control is below 70%.

The orange line shows the fraction of respondents to the CIRP Freshman Survey who "Agree" or "Strongly agree" that

The federal government should do more to control the sale of handguns.

The code I used to generate this figure is in this Jupyter notebook.

Other studies

I am not the only one to notice these patterns. This Vox article from last week reports on similar results from a 2015 Pew Survey and a 2015 Gallup Poll.

The Pew survey found that young adults are less likely than other age groups to support a ban on assault weapons (although they are also more likely to support a federal database of gun sales, and not substantially different from other age groups on some other policy proposals):

This page from the Pew Research Center shows responses to the question

What do you think is more important – to protect the right of Americans to own guns, OR to control gun ownership?

Here are the results:

Before 2007, young adults were the least likely group to choose gun rights over gun control (see the orange line). Since then, successive cohorts of young adults have shifted substantially away from gun control.

This Gallup poll shows that current young adults are more likely than previous generations to believe that more concealed weapons would make the U.S. safer:

Each of these sources is based on different questions asked of different groups, but they show remarkably consistent results.

The GSS is based on a representative sample of the adult U.S. population. It includes people of different ages, so it provides insight into the effect of birth year and age. The Freshman Survey includes only first-year college students, so it is not representative of the general population. But because all respondents are observed at the same age, it gives the clearest picture of generational changes.

The NRA regime

A possible explanation for these changes is that since the NRA created its lobbying branch in 1975 and its political action committee in 1976, it has succeeded in making gun rights (and opposition to gun control) part of the conservative identity.

We should expect their efforts to have the biggest effect on the generation raised in the 1980s and 90s, and we should expect them to have a stronger effect on conservatives than liberals.

The following figure shows the same data from the GSS, grouped by political self-identification:

As expected, support for gun control has dropped most among people who identify as conservative.

Among moderates, it might have dropped, but not by as much. The last data point, for people born around 1995, might be back up, but it is based on a small sample, and may not be reliable.

Support among liberals has been mostly unchanged, except for the last point in the series which, again, may not be reliable, as indicated by the wide error bars.

These results suggest that the decrease in support for gun control has been driven primarily by changing views among young conservatives.

UPDATE: NPR has a related story from a few days ago. They report that "Millennials are no more liberal on gun control than their parents or grandparents — despite diverging from their elders on the legalization of marijuana, same-sex marriage and other social issues."

The six stages of computational science

2018-02-23T05:27:00.002-08:00

This is the second in a series of articles related to computational science and education. The first article is here.

The Six Stages of Computational Science

When I was in grad school, I collaborated with a research group working on computational fluid dynamics. They had accumulated a large, complex code base, and it was starting to show signs of strain. Parts of the system, written by students who had graduated, had become black magic: no one knew how they worked, and everyone was afraid to touch them. When new students joined the group, it took longer and longer for them to get oriented. And everyone was spending more time debugging than developing new features or generating results.

When I inspected the code, I found what you might expect: low readability, missing documentation, large functions with complex interfaces, poor organization, minimal error checking, and no automated tests. In the absence of version control, they had many versions of every file, scattered across several machines.

I'm not sure if anyone could have helped them, but I am sure I didn't. To be honest, my own coding practices were not much better than theirs, at the time.

The problem, as I see it now, is that we were caught in a transitional form of evolution: the nature of scientific computing was changing quickly; professional practice, and the skills of the practitioners, weren't keeping up.

To explain what I mean, I propose a series of stages describing practices for scientific computing.

Stage 1, Calculating: Mostly plugging numbers into into formulas, using a computer as a glorified calculator.

Stage 2, Scripting: Short programs using built in functions, mostly straight line code, few user-defined functions.

Stage 3, Hacking: Longer programs with poor code quality, usually lacking documentation.

Stage 4, Coding: Good quality code which is readable, demonstrably correct, and well documented.

Stage 5, Architecting: Code organized in functions, classes (maybe), and libraries with well designed APIs.

Stage 6, Engineering: Code under version control, with automated tests, build automation, and configuration management.

These stages are, very roughly, historical. In the earliest days of computational science, most projects were at Stages 1 and 2. In the last 10 years, more projects are moving into Stages 4, 5, and 6. But that project I worked on in grad school was stuck at Stage 3.

The Valley of Unreliable Science

These stages trace a U-shaped curve of reliability:

By "reliable", I mean science that provides valid explanations, correct predictions, and designs that work.

At Stage 1, Calculating, the primary scientific result is usually analytic. The correctness of the result is demonstrated in the form of a proof, using math notation along with natural and technical language. Reviewers and future researchers are expected to review the proof, but no one checks the calculation. Fundamentally, Stage 1 is no different from pre-computational, analysis-based science; we should expect it to be as reliable as our ability to read and check proofs, and to press the right buttons on a calculator.

At Stage 2, Scripting, the primary result is still analytic, the supporting scripts are simple enough to be demonstrably correct, and the libraries they use are presumed to be correct.

But Stage 2 scripts are not always made available for review, making it hard to check their correctness or reproduce their results. Nevertheless, Stage 2 was considered acceptable practice for a long time; and in some fields, it still is.

Stage 3, Hacking, has the same hazards as Stage 2, but at a level that's no longer acceptable. Small, simple scripts tend to grow into large, complex programs. Often, they contain implementation details that are not documented anywhere, and there is no practical way to check their correctness.

Stage 3 is not reliable because it is not reproducible. Leek and Peng define reproducibility as "the ability to recompute data analytic results given an observed dataset and knowledge of the data analysis pipeline."

Reproducibility does not guarantee reliability, as Leek and Peng acknowledge in the title of their article, "Reproducible research can still be wrong". But without reproducibility as a requirement of published research, there is no way to be confident of its reliability.

Climbing out of the valley

Stages 4, 5, and 6 are the antidote to Stage 3. They describe what's needed to make computational science reproducible, and therefore more likely to be reliable.

At a minimum, reviewers of a publication and future researchers should be able to:

1) Download all data and software used to generate the results.

2) Run tests and review source code to verify correctness.

3) Run a build process to execute the computation.

To achieve these goals, we need the tools of software engineering:

1) Version control makes it possible to maintain an archived version of the code used to produce a particular result. Examples include Git and Subversion.

2) During development, automated tests make programs more likely to be correct; they also tend to improve code quality. During review, they provide evidence of correctness, and for future researchers they provide what is often the most useful form of documentation. Examples include unittest and nose for Python and JUnit for Java.

3) Automated build systems document the high-level structure of a computation: which programs process which data, what outputs they produce, etc. Examples include Make and Ant.

4) Configuration management tools document the details of the computational environment where the result was produced, including the programming languages, libraries, and system-level software the results depend on. Examples include package managers like Conda that document a set of packages, containers like Docker that also document system software, and virtual machines that actually contain the entire environment needed to run a computation.

These are the ropes and grappling hooks we need to climb out of the Valley of Unreliable Science.

Unfortunately, most people working in computational science did not learn these tools in school, and they are not easy to learn. For example, Git, which has emerged as the dominant version control system, is notoriously hard to use. Even with GitHub and graphical clients, it's still hard. We have a lot of work to do to make these tools better.

Nevertheless, it is possible to learn basic use of these tools with a reasonable investment of time. Software Carpentry offers a three hour workshop on Git and a 4.5 hour workshop on automated build systems. You could do both in a day (although I'm not sure I'd recommend it).

Implications for practitioners

There are two ways to avoid getting stuck in the Valley of Unreliable Science:

1) Navigate Through It: One common strategy is to start with simple scripts; if they grow and get too complex, you can improve code quality as needed, add tests and documentation, and put the code under version control when it is ready to be released.

2) Jump Over It: The alternative strategy is to maintain good quality code, write documentation and tests along with the code (or before), and keep all code under version control.

Naively, it seems like Navigating is better for agility: when you start a new project, you can avoid the costs of over-engineering and test ideas quickly. If they fail, they fail fast; and if they succeed, you can add elements of Stages 4, 5, and 6 on demand.

Based on that thinking, I used to be a Navigator, but now I am a Jumper. Here's what changed my mind:

1) The dangers of over-engineering during the early stages of a project are overstated. If you are in the habit of creating a new repository for each project (or creating a directory in an existing repository), and you start with a template project that includes a testing framework, the initial investment is pretty minimal. It's like starting every program with a copy of "Hello, World".

2) The dangers of engineering too late are much greater: if you don't have tests, it's hard to refactor code; if you can't refactor, it's hard to maintain code quality; when code quality degrades, debugging time goes up; and if you don't have version control, you can't revert to a previous working (?) version.

3) Writing documentation saves time you would otherwise spend trying to understand code.

4) Writing tests saves time you would otherwise spend debugging.

5) Writing documentation and tests as you go along also improves software architecture, which makes code more reusable, and that saves time you (and other researchers) would otherwise spend reimplementing the wheel.

6) Version control makes collaboration more efficient. It provides a record of who changed what and when, which facilitates code and data integrity. It provides mechanisms for developing new code without breaking the old. And it provides a better form of file backup, organized in coherent changes, rather than by date.

Maybe surprisingly, using software engineering tools early in a project doesn't hurt agility; it actually facilitates it.

Implications for education

For computational scientists, I think it's better to jump over the Valley of Unreliable Science than try to navigate through it. So what does that imply for education? Should we teach the tools and practices of software engineering right from the beginning? Or do students have to spend time navigating the Valley before they learn to jump over it?

I'll address these questions in the next article.

Learning to program is getting harder

2018-02-16T09:15:00.000-08:00

I have written several books that use Python to explain topics like Bayesian Statistics and Digital Signal Processing. Along with the books, I provide code that readers can download from GitHub. In order to work with this code, readers have to know some Python, but that's not enough. They also need a computer with Python and its supporting libraries, they have to know how to download code from GitHub, and then they have to know how to run the code they downloaded.

And that's where a lot of readers get into trouble.

Some of them send me email. They often express frustration, because they are trying to learn Python, or Bayesian Statistics, or Digital Signal Processing. They are not interested in installing software, cloning repositories, or setting the Python search path!

I am very sympathetic to these reactions. And in one sense, their frustration is completely justified: it should not be as hard as it is to download a program and run it.

But sometimes their frustration is misdirected. Sometimes they blame Python, and sometimes they blame me. And that's not entirely fair.

Let me explain what I think the problems are, and then I'll suggest some solutions (or maybe just workarounds).

The fundamental problem is that the barrier between using a computer and programming a computer is getting higher.

When I got a Commodore 64 (in 1982, I think) this barrier was non-existent. When you turned on the computer, it loaded and ran a software development environment (SDE). In order to do anything, you had to type at least one line of code, even if all it did was another program (like Archon).

Since then, three changes have made it incrementally harder for users to become programmers

1) Computer retailers stopped installing development environments by default. As a result, anyone learning to program has to start by installing an SDE -- and that's a bigger barrier than you might expect. Many users have never installed anything, don't know how to, or might not be allowed to. Installing software is easier now than it used to be, but it is still error prone and can be frustrating. If someone just wants to learn to program, they shouldn't have to learn system administration first.

2) User interfaces shifted from command-line interfaces (CLIs) to graphical user interfaces (GUIs). GUIs are generally easier to use, but they hide information from users about what's really happening. When users really don't need to know, hiding information can be a good thing. The problem is that GUIs hide a lot of information programmers need to know. So when a user decides to become a programmer, they are suddenly confronted with all the information that's been hidden from them. If someone just wants to learn to program, they shouldn't have to learn operating system concepts first.

3) Cloud computing has taken information hiding to a whole new level. People using web applications often have only a vague idea of where their data is stored and what applications they can use to access it. Many users, especially on mobile devices, don't distinguish between operating systems, applications, web browsers, and web applications. When they upload and download data, they are often confused about where is it coming from and where it is going. When they install something, they are often confused about what is being installed where.

For someone who grew up with a Commodore 64, learning to program was hard enough. For someone growing up with a cloud-connected mobile device, it is much harder.

Well, what can we do about that? Here are a few options (which I have given clever names):

1) Back to the future: One option is to create computers, like my Commodore 64, that break down the barrier between using and programming a computer. Part of the motivation for the Raspberry Pi, according to Eben Upton, is to re-create the kind of environment that turns users into programmers.

2) Face the pain: Another option is to teach students how to set up and use a software development environment before they start programming (or at the same time).

3) Delay the pain: A third option is to use cloud resources to let students start programming right away, and postpone creating their own environments.

In one of my classes, we face the pain; students learn to use the UNIX command line interface at the same time they are learning C. But the students in that class already know how to program, and they have live instructors to help out.

For beginners, and especially for people working on their own, I recommend delaying the pain. Here are some of the tools I have used:

1) Interactive tutorials that run code in a browser, like this adaptation of How To Think Like a Computer Scientist;

2) Entire development environments that run in a browser, like PythonAnywhere; and

3) Virtual machines that contain complete development environments, which users can download and run (providing that they have, or can install, the software that runs the virtual machine).

4) Services like Binder that run development environments on remote servers, allowing users to connect using browsers.

On various projects of mine, I have used all of these tools. In addition to the interactive version of "How To Think...", there is also this interactive version of Think Java, adapted and hosted by Trinket.

In Think Python, I encourage readers to use PythonAnywhere for at least the first four chapters, and then I provide instructions for making the transition to a local installation.

I have used virtual machines for some of my classes in the past, but recently I have used more online services, like this notebook from Think DSP, hosted by O'Reilly Media. And the repositories for all of my books are set up to run under Binder.

These options help people get started, but they have limitations. Sooner or later, students will want or need to install a development environment on their own computers. But if we separate learning to program from learning to install software, their chances of success are higher.

UPDATE: Nick Coghlan suggests a fourth option, which I might call Embrace the Future: Maybe beginners can start with cloud-based development environments, and stay there.

UPDATE: Thank you for all the great comments! My general policy is that I will publish a comment if it is on topic, coherent, and civil. I might not publish a comment if it seems too much like an ad for a product or service. If you submitted a comment and I did not publish it, please consider submitting a revision. I really appreciate the wide range of opinion in the comments so far.

Build your own SOTU

2018-02-08T13:15:00.005-08:00

In the New York Time on Tuesday, John McWhorter argues that Donald Trump's characteristic speech patterns are not, as some have suggested, evidence of mental decline. Rather, the quality of Trump's public speech has declined because, according to McWhorter:

1) "The younger Mr. Trump [...] had a businessman’s normal inclination to present himself in as polished a manner as possible in public settings", and

2) The older Trump has "settled into his normal" because as president, he "has no impetus to speak in a way unnatural to him in public".

It's an interesting article, and I encourage you to read it before I start getting silly about it.

I would like to suggest an alternative interpretation, which is that the older Trump's speech sounds as it does because it is being generated by a Markov chain.

A Markov chain is a random process that generates a sequence of tokens; in this case, the tokens are words. I explain the details below, but first I want to show some results. Compare these two paragraphs:

"You know, if you're a conservative Republican, if I were a liberal, if, like, okay, if I ran as a liberal Democrat, they would say I'm one of the smartest people anywhere in the world – it’s true! – but when you're a conservative Republican they try – oh, they do a number"

"I mean—think of this—I hate to say it but it’s the same wall that we’re always talking about. It’s—you know, wherever we need, we don’t make a good chance to make a deal on DACA, I really have gotten to like. And I know it’s a hoax."

One of those paragraphs was generated by a Markov chain I trained with the unedited transcript from this recent interview with the Wall Street Journal. The other was generated by Donald Trump. Can you tell which is which?

Ok, let's make it a little harder. Here are ten examples: some are from Trump, some are from Markov. See if you can tell which are which.

1) I would have said it’s all in the messenger; fellas, and it is fellas because, you know, they don’t, they haven’t figured that the women are smarter right now than the men, so, you know, it’s gonna take them about another 150 years — but the Persians are great negotiators, the Iranians are great negotiators, so, and they, they just killed, they just killed us.

2) And we have sort of interesting, but when people make misstatements somebody has some, you know, I went through some that weren’t so elegant. But all I’m asking is one thing, you know Obama felt—President Obama felt it was his biggest problem is going to be Dreamers also. But there’s a big difference—first of all, there’s a big problem, and they were only going to be solved.

3) One of the promises that you know is being very seriously negotiated right now is the wall and the wall will happen. And if you look—point, after point, after point—now we’ve had some turns. You always have to have flexibility.

4) Yeah, Rex and I think we’ll have something on that. We’ll find out. But people do leave. You guys may leave but I don’t know of one politician in Washington—if you’re a politician and somebody called up that they have phony sources, when the sources don’t exist, yeah I think would be frankly a positive for our country made wealthy.

5) They have an election coming up fairly shortly, and I understand that that makes it a little bit difficult for them, and I’m not looking to make the other side—so we’ll either make a deal or—there’s no rush, but I will say that if we don’t make a fair deal for this country, a Trump deal, then we’re not going to have—then we’re going to have a—I will terminate.

6) You’re here, you’ve got the wall is the same wall I’ve always talked about. I think we have companies pouring back into this country and you don’t know who’s there, you’ve got the wall will happen. We have a very old report. Business, generally, manufacturing the same wall that we’re talking about or whatever it may be.

7) And they endorsed us unanimously. I had meetings with them, they need see-through. So, we need a form of fence or window. I said why you need that—makes so much sense? They said because we have to see who’s on the other side.

8) Well they will make sure that no country including Russia can have anything to do with my win. Hope, just out of the most elegant debate—I thought it was a dead meeting. No, I never forget, when I fired, all these people, they all wanted him fired until I said, ‘We got to get worse'.

9) The governor of Wisconsin has been fantastic in their presentations and everything else. But I’m the one who got them to look at it. Now we need people because they’re going to have thousands of people working it’s going to be a—you know—that’s—that’s the company that makes the Apple iPhone.

10) So, they make up a television show. As you know, I went to the—I went to the—I went to the—I went to the employees—to millions and millions of employees. And AT&T started it, but I will terminate Nafta. OK? You know, we only have a thing called trade.

The first person to submit correct answers will be sequestered in a sensory deprivation tank until January 20, 2021.

Here's the Jupyter notebook I used to generate the examples. If you want to know more about how it works, see this section of Think Python, second edition.

Computation in STEM Workshop

2018-01-08T09:10:00.003-08:00

Last week I had the pleasure of visiting UC Davis, where I co-led (along with Jason Moore) a workshop on using computation in the STEM curriculum.

We had about 20 participants, including faculty, staff, and graduate students from engineering, math, natural sciences and social sciences. Classes at UC Davis start today, so we appreciate the time the participants took from a busy week!

We hope to run this workshop again at Olin College's Summer Institute 2018.

Abstract:

This workshop invites faculty to think about computation in the context of engineering education and to design classroom experiences that develop programming skills and apply them to engineering topics. Starting from examples in signal processing and mechanics, participants will identify topics that might benefit from a computational approach and design course materials to deploy in their classes. Although our examples come from engineering, this workshop may also be of interest to faculty in the natural and social sciences as well as mathematics.

Here are our slides:

Video from the workshop will be available soon.

Many thanks to Jason Moore in the MAE Department at UC Davis for inviting me and running the workshop with me, to Pamela Reynolds at the UC Davis Data Science Initiative for hosting us, and to the Collaboratory at Olin College for supporting my participation. This workshop was supported by funding from the Undergraduate Instructional Innovation Program, which is funded by the Association of American Universities (AAU) and Google, and administered by UC Davis's Center for Educational Effectiveness.

The retreat from religion is accelerating

2017-10-20T04:00:00.002-07:00

This is an extended version of my article in the Scientific American blog.

The data I used and all of my code are available in this Jupyter notebook.

Secularization in the Unites States

For more than a century religion in the the United States has defied gravity. According to the Theory of Secularization, as societies become more modern, they become less religious. Aspects of secularization include decreasing participation in organized religion, loss of religious belief, and declining respect for religious authority.

Until recently the United States has been a nearly unique counterexample, so I would be a fool to join the line of researchers who have predicted the demise of religion in America. Nevertheless, I predict that secularization in the U.S. will accelerate in the next 20 years.

Using data from the General Social Survey (GSS), I quantify changes since the 1970s in religious affiliation, belief, and attitudes toward religious authority, and present a demographic model that generates predictions.

Summary of results

Religious affiliation is changing quickly:

The fraction of people with no religious affiliation has increased from less than 10% in the 1990s to more than 20% now. This increase will accelerate, overtaking Catholicism in the next few years, and probably replacing Protestantism as the largest religious affiliation within 20 years.
Protestantism has been in decline since the 1980s. Its population share dropped below 50% in 2012, and will fall below 40% within 20 years.
Catholicism peaked in the 1980s and will decline slowly over the next 20 years, from 24% to 20%.
The share of other religions increased from 4% in the 1970s to 6% now, but will be essentially unchanged in the next 20 years.

Religious belief is in decline, as well as confidence in religious institutions:

The fraction of people who say they “know God really exists and I have no doubts about it” has decreased from 64% in the 1990s to 58% now, and will approach 50% in the next 20 years.
At the same time the share of atheists and agnostics, based on self-reports, has increased from 6% to 10%, and will reach 14% around 2030.
Confidence in the people running organized religions is dropping rapidly: the fraction who report a “great deal” of confidence has dropped from 36% in the 1970s to 19% now, while the fraction with “hardly any” has increased from 17% to 26%. At 3-4 percentage points per decade, these are among the fastest changes we expect to see in this kind of data.
Interpretation of the Christian Bible has changed more slowly: the fraction of people who believe the Bible is “the actual word of God and is to be taken literally, word for word” has declined from 36% in the 1980s to 32% now, little more than 1 percentage point per decade.
At the same time the number of people who think the Bible is “an ancient book of fables, legends, history and moral precepts recorded by man” has nearly doubled, from 13% to 22%. This skepticism will approach 30%, and probably overtake the literal interpretation, within 20 years.

Predictive demography

Let me explain where these predictions come from. Since 1972 NORC at the University of Chicago has administered the General Social Survey (GSS), which surveys 1000-2000 adults in the U.S. per year. The survey includes questions related to religious affiliation, attitudes, and beliefs.

Regarding religious affiliation, the GSS asks “What is your religious preference: is it Protestant, Catholic, Jewish, some other religion, or no religion?” The following figure shows the results, with a 90% interval that quantifies uncertainty due to random sampling.

This figure provides an overview of trends in the population, but it is not easy to tell whether they are accelerating, and it does not provide a principled way to make predictions. Nevertheless, demographic changes like this are highly predictable (at least compared to other kinds of social change).

Religious beliefs and attitudes are primarily determined by the environment people grow up in, including their family life and wider societal influences. Although some people change religious affiliation later in life, most do not, so changes in the population are largely due to generational replacement.

We can get a better view of these changes if we group people by their year of birth, which captures information about the environment they grew up in, including the probability that they were raised in a religious tradition and their likely exposure to people of other religions. The following figure shows the results:

Among people born before 1940, a large majority are Protestant, only 20-25% are Catholic, and very few are Nones or Others. These numbers have changed radically in the last few generations: among people born since 1980, there are more Nones than Catholics, and among the youngest adults, there may already be more Nones than Protestants.

However, this view of the data can be misleading. Because these surveys were conducted between 1972 and the present, we observe different birth cohorts at different ages. People born in 1900 were surveyed in their 70s and 80s, whereas people born in 1998 have only been observed at age 18. If people tend to drift toward, or away from, religion as they age, we would have a biased view of the cohort effect.

Fortunately, with observations over more than 40 years, the design of the GSS makes it possible to estimate the effects of birth year and age simultaneously, using a regression model. Then we can simulate the results of future surveys. Here’s how:

Each year, the GSS recruits a sample intended to represent the adult U.S. population, so the age range of the respondents is nearly the same every year. We assume the set of ages will be the same for future surveys.
Given the ages of hypothetical future respondents, we infer their years of birth. For example, if we survey a 40-year-old in 2020, we know they were born in 1980.
Given ages and years of birth, we use the regression model to predict the probability that each respondent will report being Protestant, Catholic, Other, or None.
Then we use these probabilities to simulate survey results and predict the fraction of respondents in each group.

The following figure shows the results, with 90% intervals that represent uncertainty due to random sampling in the dataset and random variation in the simulations.

Over the next 20 years, the fraction of Protestants (including non-Catholic Christians) will decline quickly, falling below 40% around 2030. The fraction of Catholics will decline more slowly, approaching 20%. The fraction of other religions might increase slightly.

The fraction of “Nones” will increase quickly, overtaking Catholics in the next few years, and possibly becoming the largest religious group in the U.S. by 2036.

Are these predictions credible?

To see how reliable these predictions are, we can use past data to predict the present. Supposing it’s 2006, and disregarding data from after 2006, the following figure shows the predictions we would make:

As it turns out, we would have been pretty much right, although we might have underpredicted the growth of the Nones.

Another reason to believe these predictions is that the events they predict have, in some sense, already happened. The people who will be 40 years old in 2036 are 20 now, and we already have data about them. The people who will be 20 in 2036 have already been born.

These predictions will be wrong if current teenagers are more religious than people in their 20s, or if current children are being raised in a more religious environment. But if those things were happening, we would probably know.

In fact, these predictions are likely to be conservative:

Survey results like these are notoriously subject to social desirability bias, which is the tendency of respondents to shade their answers in the direction they think is more socially acceptable. To the degree that disaffiliation is stigmatized, we expect these reports to underestimate the number of Nones.
The trend lines for Protestant and None have apparent points of inflection near 1990. If we use only data since 1990 to build the model, we expect the Nones to reach 40% within 20 years.

Changes in religious belief

As affiliation with organized religion has declined, changes in religious belief have been relatively unchanged, a pattern that has been summarized as “believing without belonging”. However there is evidence that believing will catch up with belonging over the next 20 years.

The GSS asks respondents, “Which statement comes closest to expressing what you believe about God?”

I don't believe in God
I don't know whether there is a God and I don't believe there is any way to find out
I don't believe in a personal God, but I do believe in a Higher Power of some kind
I find myself believing in God some of the time, but not at others
While I have doubts, I feel that I do believe in God
I know God really exists and I have no doubts about it

To make the number of categories more manageable, I classify responses 1 and 2 as “no belief”, responses 3, 4, and 5 as “belief”, and response 6 as “strong belief”.

The following figure shows how belief in God varies with year of birth.

Among people born before 1940, more than 70% profess strong belief in God, but this confidence is in decline; among young adults fewer than 40% are so certain, and nearly 20% are either atheist or agnostic.

Again, we can use these results to model the effect of birth year and age, and use the model to generate predictions. The following figure shows the results:

This question was added to the survey in 1988, and it has not been asked every year, so we have less data to work with. Nevertheless, it is clear that strong belief in God is declining and being replaced by weaker forms of belief and non-belief.

Due to social desirability bias we can’t be sure what part of these trends is due to actual changes in belief, and how much might be the result of weakening stigmas against apostasy and atheism. Regardless, these results indicate changes in what people say they believe.

Respect for religious authority

The GSS asks respondents, “As far as the people running [organized religion] are concerned, would you say you have a great deal of confidence, only some confidence, or hardly any confidence at all in them?”

The following figure shows how respect for religious authority varies with year of birth.

Among people born before 1940, 30 to 50% reported a “great deal” of confidence in the people running religious institutions. Among young adults, this has dropped to 20%, and more than 25% now report “hardly any confidence at all”.

These changes have been going on for decades, and seem to be unrelated to specific events. The following figures shows responses to the same question by year of survey. The Catholic Church sexual abuse cases, which received widespread media attention starting in 1992, have no clear effect on the trends; if anything, confidence in religious institutions increased during the 1990s.

Predictions based on generational replacement suggest that these trends will continue. Within 20 years, the fraction of people with hardly any confidence in religious institutions will approach 30%.

Interpretation of the Bible

The GSS asks, “Which one of these statements comes closest to describing your feelings about the Bible?”

The Bible is the actual word of God and is to be taken literally, word for word.
The Bible is the inspired word of God but not everything should be taken literally, word for word.
The Bible is an ancient book of fables, legends, history and moral precepts recorded by man.

Responses to this question depend strongly on the respondents’ year of birth:

Among people born before 1940, more than 40% say they believe in a literal interpretation of the Christian Bible, and fewer than 15% consider it a collection of fables and legends. Among young adults, these proportions have converged near 25%.

The number of people who believe that the Bible is the inspired word of God, but should not be interpreted literally, has been near 50% for several generations. But this apparent equilibrium might mask two underlying trends: an increase due to transitions from literal to figurative interpretation, and a decrease due to transitions from “inspired” to “legends”.

The following figure shows responses to the same question over time, with predictions.

In the next 20 years, people who consider the Bible the literal or inspired word of God will be replaced by people who consider it a collection of ordinary documents, but this transition will be slow.

Again, these responses are susceptible to social desirability bias, so they may not reflect true beliefs accurately. But they reflect changes in what people say they believe, which might cause a feedback effect: as more people express their non-belief, stigmas around atheism will decline, and these trends may accelerate.

Religion in the United States

2017-06-14T13:28:00.000-07:00

Last night I had the pleasure of presenting a talk for the PyData Boston Meetup. I presented a project I started earlier this summer, using data from the General Social Survey to measure and predict trends in religious affiliation and belief in the U.S.

The slides, which include the results so far and an overview of the methodology, are here:

And the code and data are all in this Jupyter notebook. I'll post additional results and discussion over the next few weeks.

Thanks to Milos Miljkovic, organizer of the PyData Boston Meetup, for inviting me, and to O'Reilly Media for hosting the meeting.

Spring 2017 Data Science reports

2017-06-01T11:51:00.001-07:00

In my Data Science class this semester, students worked on a series of reports where they explore a freely-available dataset, use data to answer questions, and present their findings. After each batch of reports, I will publish the abstracts here; you can follow the links below to see what they found.

How Do You Predict Who Will Vote?

Sean Carter

One topic that enters popular discussion every four years is "who votes?" Every presidential election we see many discussions on which groups are more likely to vote, and which important voter groups each candidate needs to capture. One theme that is often part of this discussion is whether or not a candidate's biggest support is among groups likely to turn out. This analysis of the General Social Survey uses a number of different demographic variables to try and answer that question. Report

Designing the Optimal Employee Experience... For Employers

Joey Maalouf

Using a dataset published by Medium on Kaggle, I explored the relationship between an employee's working conditions and the likelihood that they will quit their job. There were some expected trends, like lower salary leading to a higher attrition rate, but also some surprising ones, like having an accident at work leading to a lower likelihood of quitting! This observed information can be used by employers to determine the quitting probability of a specific individual, or to calculate the attrition rate of a larger group, like a department, and adjust their conditions accordingly.
^Report

Does being married have an effect on your political views?

Apurva Raman and William Lu

Politics has often been a polarizing subject amongst Americans, and in today's increasingly partisan political environment, that has not changed. Using data from the General Social Survey (GSS), an annual study designed and conducted by the National Opinion Research Center (NORC) at the University of Chicago, we identify variables that are correlated with a person's political views. We find that while marital status has a statistically significant apparent effect on political views, that apparent effect is drastically reduced when including confounding variables, particularly religion. Report

Should you Follow the Food Groups for Dietary Advice?

Kaitlyn Keil and Kevin Zhang

In the 1990s, the USDA put out the image of a Food Guide Pyramid to help direct dietary choices. It grouped foods into six categories: grains, proteins (meats, fish, eggs, etc), vegetables, fruits, dairy, and fats and oils. Since then, the pyramid has been revamped in 2005, and then pushed towards a plate with five categories (oils were dropped) in the 2010s. The general population has learned of these basic food groups since grade school, and over time either fully adopts them into their lifestyles, or abandons them to pursue their own balanced diet. In light of the controversy surrounding the Food Pyramid, we decided to ask whether the food categories found in the Food Pyramid truly represent the correct groupings for food, and if not, just how far off are they? Using K-Means clustering on an extensive food databank, we created 6 groupings of food based on their macronutrient composition, which was the primary criteria the original Food Pyramid used in its categorization. We found that the K-Means groups only overlapped with existing food groups from the Food Pyramid 50% of the time, potentially suggesting that the idea of the basic food groups could be outdated. Report

Are Terms of Home Mortgage Less Favorable Now Compared to Pre Mortgage Crisis?

Sungwoo Park

It is well known fact that excessive amount of default from subprime mortgages, which are mortgages normally issued to a borrower of low credit, was a leading cause of subprime mortgage crisis that led to a global financial meltdown in 2007. Because of this nightmarish experience, it seems plausible to assume that current home mortgages are much harder to get and much more conservative (in terms of risks the lender is taking, shown mainly as an interest rate) than pre-2007 mortgages. Using a dataset containing all home mortgages purchased or guaranteed from The Federal Home Loan Mortgage Corporation, more commonly known as Freddie Mac, I investigate whether there is any noticeable difference between the interest rates before and after subprime mortgage crisis.
Report

Finding NBA Players with Similar Styles

Willem Thorbecke and David Papp

Players in the NBA are often compared to others, both active and retired, based on similar play styles. For example, it is common to hear statements such as “Russell Westbrook is the new Derrick Rose”. The purpose of our project is to apply machine learning in the form of clustering to see which players are actually similar based on 22 variables. We successfully generated clusters of players that are very similar quantitatively. It is up to the reader to decide whether this is qualitatively true. Report

Food Trinities and Recipe Completion

Matt Ruehle

We can tell where a food is from - at least, culturally - from just a few bites. There are palettes of ingredients and spices which are strongly associated with each other - giving cajun cooking its kick, and french cuisine its "je ne sais quoi." But, what exactly these palettes and pairings are varies - ask ten different chefs, and you'll get six different answers. We look for a statistical way to identify "trinities" like "onion, carrot, celery" or "garlic, sesame oil, soy sauce," in the process both finding several associations not typically reflected in culinary literature and creating a tool which extends recipes based on their already-known ingredients, in a manner akin to a food version of a cell phone's autocomplete. Report

All the News in 2010 and 2012

Radmer van der Heyde

I examined the Pew News Coverage Index dataset from the years 2010 and 2012 to see how the different topics and stories were covered across media sectors and sources. The combined dataset had over 70,000 stories from all media sectors: print, online, cable tv, network tv, and broadcast radio. From the data, topics have less variance in word count and duration than sources. Report

Python as a way of thinking

2017-04-26T12:12:00.002-07:00

This article contains supporting material for this blog post at Scientific American. The thesis of the post is that modern programming languages (like Python) are qualitatively different from the first generation (like FORTRAN and C), in ways that make them effective tools for teaching, learning, exploring, and thinking.

I presented a longer version of this argument in a talk I presented at Olin College last fall. The slides are here:

Here are Jupyter notebooks with the code examples I mentioned in the talk:

Breadth-first search in Python
Using Counters, including the Bayesian update example.
Introduction to PMFs, including the anagram example.
Vectors, Frames, and Transforms.
Cacophony for the Whole Family, an example from Think DSP.

Here's my presentation at SciPy 2015, where I talked more about Python as a way of teaching and learning DSP:

Finally, here's the notebook "Using Counters", which uses Python's Counter object to implement a PMF (probability mass function) and perform Bayesian updates.

In [13]:

from __future__ import print_function, division

from collections import Counter
import numpy as np

A counter is a map from values to their frequencies. If you initialize a counter with a string, you get a map from each letter to the number of times it appears. If two words are anagrams, they yield equal Counters, so you can use Counters to test anagrams in linear time.

In [3]:

def is_anagram(word1, word2):
    """Checks whether the words are anagrams.

    word1: string
    word2: string

    returns: boolean
    """
    return Counter(word1) == Counter(word2)

In [4]:

is_anagram('tachymetric', 'mccarthyite')

Out[4]:

True

In [5]:

is_anagram('banana', 'peach')

Out[5]:

False

Multisets
A Counter is a natural representation of a multiset, which is a set where the elements can appear more than once. You can extend Counter with set operations like is_subset:

In [6]:

class Multiset(Counter):
    """A multiset is a set where elements can appear more than once."""

    def is_subset(self, other):
        """Checks whether self is a subset of other.

        other: Multiset

        returns: boolean
        """
        for char, count in self.items():
            if other[char] < count:
                return False
        return True
    
    # map the <= operator to is_subset
    __le__ = is_subset

You could use is_subset in a game like Scrabble to see if a given set of tiles can be used to spell a given word.

In [7]:

def can_spell(word, tiles):
    """Checks whether a set of tiles can spell a word.

    word: string
    tiles: string

    returns: boolean
    """
    return Multiset(word) <= Multiset(tiles)

In [8]:

can_spell('SYZYGY', 'AGSYYYZ')

Out[8]:

True

Probability Mass Functions¶

You can also extend Counter to represent a probability mass function (PMF).
normalize computes the total of the frequencies and divides through, yielding probabilities that add to 1.
__add__ enumerates all pairs of value and returns a new Pmf that represents the distribution of the sum.
__hash__ and __id__ make Pmfs hashable; this is not the best way to do it, because they are mutable. So this implementation comes with a warning that if you use a Pmf as a key, you should not modify it. A better alternative would be to define a frozen Pmf.
render returns the values and probabilities in a form ready for plotting

In [9]:

class Pmf(Counter):
    """A Counter with probabilities."""

    def normalize(self):
        """Normalizes the PMF so the probabilities add to 1."""
        total = float(sum(self.values()))
        for key in self:
            self[key] /= total

    def __add__(self, other):
        """Adds two distributions.

        The result is the distribution of sums of values from the
        two distributions.

        other: Pmf

        returns: new Pmf
        """
        pmf = Pmf()
        for key1, prob1 in self.items():
            for key2, prob2 in other.items():
                pmf[key1 + key2] += prob1 * prob2
        return pmf

    def __hash__(self):
        """Returns an integer hash value."""
        return id(self)
    
    def __eq__(self, other):
        return self is other

    def render(self):
        """Returns values and their probabilities, suitable for plotting."""
        return zip(*sorted(self.items()))

As an example, we can make a Pmf object that represents a 6-sided die.

In [10]:

d6 = Pmf([1,2,3,4,5,6])
d6.normalize()
d6.name = 'one die'
print(d6)

Pmf({1: 0.16666666666666666, 2: 0.16666666666666666, 3: 0.16666666666666666, 4: 0.16666666666666666, 5: 0.16666666666666666, 6: 0.16666666666666666})

Using the add operator, we can compute the distribution for the sum of two dice.

In [11]:

d6_twice = d6 + d6
d6_twice.name = 'two dice'

for key, prob in d6_twice.items():
    print(key, prob)

2 0.0277777777778
3 0.0555555555556
4 0.0833333333333
5 0.111111111111
6 0.138888888889
7 0.166666666667
8 0.138888888889
9 0.111111111111
10 0.0833333333333
11 0.0555555555556
12 0.0277777777778

Using numpy.sum, we can compute the distribution for the sum of three dice.

In [14]:

# if we use the built-in sum we have to provide a Pmf additive identity value
# pmf_ident = Pmf([0])
# d6_thrice = sum([d6]*3, pmf_ident)

# with np.sum, we don't need an identity
d6_thrice = np.sum([d6, d6, d6])
d6_thrice.name = 'three dice'

And then plot the results (using Pmf.render)

In [19]:

import matplotlib.pyplot as plt
%matplotlib inline

In [20]:

for die in [d6, d6_twice, d6_thrice]:
    xs, ys = die.render()
    plt.plot(xs, ys, label=die.name, linewidth=3, alpha=0.5)
    
plt.xlabel('Total')
plt.ylabel('Probability')
plt.legend()
plt.show()

Bayesian statistics¶

A Suite is a Pmf that represents a set of hypotheses and their probabilities; it provides bayesian_update, which updates the probability of the hypotheses based on new data.
Suite is an abstract parent class; child classes should provide a likelihood method that evaluates the likelihood of the data under a given hypothesis. update_bayesian loops through the hypothesis, evaluates the likelihood of the data under each hypothesis, and updates the probabilities accordingly. Then it re-normalizes the PMF.

In [21]:

class Suite(Pmf):
    """Map from hypothesis to probability."""

    def bayesian_update(self, data):
        """Performs a Bayesian update.
        
        Note: called bayesian_update to avoid overriding dict.update

        data: result of a die roll
        """
        for hypo in self:
            like = self.likelihood(data, hypo)
            self[hypo] *= like

        self.normalize()

As an example, I'll use Suite to solve the "Dice Problem," from Chapter 3 of Think Bayes.
"Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die. If you have ever played Dungeons & Dragons, you know what I am talking about. Suppose I select a die from the box at random, roll it, and get a 6. What is the probability that I rolled each die?"
I'll start by making a list of Pmfs to represent the dice:

In [31]:

def make_die(num_sides):
    die = Pmf(range(1, num_sides+1))
    die.name = 'd' + str(num_sides)
    die.normalize()
    return die

dice = [make_die(x) for x in [4, 6, 8, 12, 20]]
for die in dice:
    print(die)

Pmf({1: 0.25, 2: 0.25, 3: 0.25, 4: 0.25})
Pmf({1: 0.16666666666666666, 2: 0.16666666666666666, 3: 0.16666666666666666, 4: 0.16666666666666666, 5: 0.16666666666666666, 6: 0.16666666666666666})
Pmf({1: 0.125, 2: 0.125, 3: 0.125, 4: 0.125, 5: 0.125, 6: 0.125, 7: 0.125, 8: 0.125})
Pmf({1: 0.08333333333333333, 2: 0.08333333333333333, 3: 0.08333333333333333, 4: 0.08333333333333333, 5: 0.08333333333333333, 6: 0.08333333333333333, 7: 0.08333333333333333, 8: 0.08333333333333333, 9: 0.08333333333333333, 10: 0.08333333333333333, 11: 0.08333333333333333, 12: 0.08333333333333333})
Pmf({1: 0.05, 2: 0.05, 3: 0.05, 4: 0.05, 5: 0.05, 6: 0.05, 7: 0.05, 8: 0.05, 9: 0.05, 10: 0.05, 11: 0.05, 12: 0.05, 13: 0.05, 14: 0.05, 15: 0.05, 16: 0.05, 17: 0.05, 18: 0.05, 19: 0.05, 20: 0.05})

Next I'll define DiceSuite, which inherits bayesian_update from Suite and provides likelihood.
data is the observed die roll, 6 in the example.
hypo is the hypothetical die I might have rolled; to get the likelihood of the data, I select, from the given die, the probability of the given value.

In [26]:

class DiceSuite(Suite):
    
    def likelihood(self, data, hypo):
        """Computes the likelihood of the data under the hypothesis.

        data: result of a die roll
        hypo: Pmf object representing a die
        """
        return hypo[data]

Finally, I use the list of dice to instantiate a Suite that maps from each die to its prior probability. By default, all dice have the same prior.
Then I update the distribution with the given value and print the results:

In [33]:

dice_suite = DiceSuite(dice)

dice_suite.bayesian_update(6)

for die in sorted(dice_suite):
    print(len(die), dice_suite[die])

4 0.0
6 0.392156862745
8 0.294117647059
12 0.196078431373
20 0.117647058824

As expected, the 4-sided die has been eliminated; it now has 0 probability. The 6-sided die is the most likely, but the 8-sided die is still quite possible.
Now suppose I roll the die again and get an 8. We can update the Suite again with the new data

In [30]:

dice_suite.bayesian_update(8)

for die, prob in sorted(dice_suite.items()):
    print(die.name, prob)

d4 0.0
d6 0.0
d8 0.623268698061
d12 0.277008310249
d20 0.0997229916898

Now the 6-sided die has been eliminated, the 8-sided die is most likely, and there is less than a 10% chance that I am rolling a 20-sided die.
These examples demonstrate the versatility of the Counter class, one of Python's underused data structures.

In [ ]:

Honey, money, weather, terror

2017-04-04T06:26:00.001-07:00

In my Data Science class this semester, students are working on a series of reports where they explore a freely-available dataset, use data to answer questions, and present their findings. After each batch of reports, I will publish the abstracts here; you can follow the links below to see what they found.

The Impact of Military Status on Income Bracket

Joey Maalouf

Using the National Health Interview Survey data, I was able to look for a potential link between military status and financial status. More specifically, I wanted to check if whether someone served in the United States military affects their current income bracket. It was apparent that people who served in the military were underrepresented in low income brackets and overrepresented in high income brackets compared to the rest of the population. This difference appears more clearly if we group the income data into even broader brackets for further analysis; being in the military increased one's chances of being in the upper half of respondents by 16.06 percentage points, and of being in the upper third by 14.64 percentage points. Further statistical analysis reported a Cohen effect size of 0.32, which is above the standard threshold to be considered more than a small effect.
Report

Getting Treatment

Kaitlyn Keil

In the so-called "War on Drugs", one of the primary tactics is teaching children to "Just Say No!" However, less attention is paid to treatment for those who are already addicted. Except for the occasional comment on how a celebrity disappeared off to rehab, there is a silence in our culture about the apparently shameful act of getting treatment. This silence made me begin to wonder: how many people who struggle with addictions actually get treated, and how long does it take before they receive this help? Using the National Survey on Drug Use and Health data from 2014, I found that very few people who use drugs report getting treatment or counseling, and the length of time they go without getting treatment isn't particularly correlated with other factors.
Report

What's the Chance You will Die Due to Terrorism?

Kevin Zhang

With Trump's recent travel ban and the escalation of controversial actions against Middle Eastern people, there has been a rise of paranoia towards the Middle East region for fear of the possibility of a terrorist attack. But is there is a reason to be so afraid of the Middle East, or even terrorism in general? What is the chance that an American would be a victim to terrorism? This article looks into just how likely the average person in the US will be affected by a terrorist attack, should one happen. Results show that the chance of a person being affected by terrorism in the North American region is almost 0, especially when compared to the probability in the Middle East itself. The data suggests that people's fears are unfounded and that the controversial reactions towards Middle East citizens because of a 1 in 15 million chance are irrational.
Report

Sungwoo Park

Using a dataset containing over 4 million Facebook posts from 15 mainstream news outlets, I investigate the existence of seasonality in the number of likes a Facebook post from a news outlet gets. The dataset contains contents and attributes, such as number of likes and timestamp, of all facebook posts posted by the top media sources from 2012 to 2016. The media outlets included in the data are ABC, BBC, CBS, CNN, Fox & Friends, Fox, LA Times, NBC, NPR, The Huffington Post, The New York Times, The Wall Street Journal, The Washington Post, Time, and USA Today.
Report

The Association Between Drug Usage and Depression

David Papp and Willem Thorbecke

The goal of this article is to explore the association between drug usage and depression. Intuitively, many would argue that those who use drugs are more likely to be depressed. To explore this relationship, we took data from the National Drug Usage and Health Survey from 2014. We conducted logistical regression on cocaine, marijuana, alcohol, and heroin while controlling for possible confounding variables such as sex, income, and health conditions. Surprisingly, there appears to be a negative correlation between drug usage and depression.
Report

US Apiculture and Honey Production

Matthew Ruehle and Sean Carter

We examine the historic honey-producing bee colony counts, yield, and honey production of states as collected by the USDA, finding statistical evidence for regional "clustering" of production and a negative correlation between per-hive yield and overall price, most strongly reflected in states with the greatest absolute production
Report

Most Terrorism is Local

Radmer van der Heyde

I explored the Global Terrorism Database to see how terror has evolved over time, and whether international terrorism has any defining features. Over time terrorism has increased and gotten deadlier, but shifted regions. However, international terrorism was too small a percentage of the dataset to reach an appropriate conclusion.

Report

Does it get warmer before it rains?

Apurva Raman and William Lu

Speculating about the weather has been a staple of small talk and human curiosity for a long time, and as a result, many weather "myths" exist. One such myth we’ve heard is that it gets warmer before a precipitation event (e.g. rain, snow, hail, sleet, etc.) occurs. Using data from the US National Oceanic and Atmospheric Association (NOAA), we find that change in temperature is a poor indicator for whether or not there will be a precipitation event.

Report

Money, Murder, the Midwest, and More

2017-03-08T07:11:00.001-08:00

How do Europeans feel about Jewish, Muslim, and Gypsy immigration?

Apurva Raman and Celina Bekins

As tensions over immigration increase with Europe dealing with a huge influx of refugees, some countries are more ready to accept immigrants while others close their borders to them. To understand the opinions of Europeans on immigration of particular groups, we investigated if respondents from different countries in Europe have consistent opinions toward Jews, Muslims, and Gypsies. We found that countries with a strong preference against Jews or Gypsies will also not prefer the other group. This does not hold true for Jews; countries that are willing to allow Jews are not necessarily willing to allow Muslims or Gypsies. Countries that are not accepting of Jews are not accepting of any of these three groups. However, they all preferred Jews to Muslims and Muslims to Gypsies.
Report

Do Midwestern colleges have better ACT scores?

David Papp

It is often rumored that colleges in the Midwest prefer ACT scores while colleges in other regions prefer SAT Scores. The goal of this article is to explore the relationship between SAT and ACT scores in the Midwest and other regions. The data used was collected from the US Department of Education for the years 2014-15. Comparing just the means of ACT scores shows that the Midwest scores slightly higher on average: 23.48 vs 23.17. However, a better statistic might be to compare the ratio of ACT/SAT scores. The Midwest has a slightly higher ratio (0.969) than other regions (0.960). Although we cannot deduce any causation, we can draw inferences as to what causes these differences. One explanation might be the fact that students applying to Midwestern colleges spend more time studying for the ACT.
Report

Rich or Poor: To Whom does it Matter More?

Kaitlyn Keil

With issues like a growing wage gap, racism, and feminism at the front of our nation's attention, it can seem that the wealthy only care about getting more wealth, while equality only matters to those who are disadvantaged. However, the results of the European Social Survey of 2014 suggests that those with money do not value wealth a significant amount more than those of lower income brackets, and equality is not only valued at the same level across income brackets, but is consistently rated as more important than wealth.
Report

Do More Politically Informed People Identify as Liberal?

Kevin Zhang

In the political arena, liberals often call their conservative counterparts "ignorant" because they believe that the other party doesn't know the facts, that they just don't know what's going on in the world. This would suggest that being more informed on political news and current events would make one more liberal. But does it really matter though? Does being more informed about politics make a person more liberal? Does it matter at all on how people end up voting? This article will decide whether being an informed individual truly results in believing a more liberal platform, or whether this notion is just a mislead stereotype meant as a mudsling tactic. Data analytics show that apparently a person has a high chance of holding the same opinion regardless of whether they are informed individuals or not. However, it seems that rather than leaning towards liberals, being more informed has the potential to make people more polarized towards either side and have stronger opinions on various political topics in general. While being more informed might not lead an increase in liberal thoughts, it might very well make people better able to cast a more thoughtful and representative vote.
Report

Money might buy you some happiness

Sungwoo Park

Does money buy you happiness? It's a decades old question that people have been wondering about. Data from the General Social Survey on the respondent's income and happiness level seem to suggest that people with high income tend to be happier than people with low income. Also, the data show that people with high income value the feeling of accomplishment and the importance of the job in their work more than people with low income do.
Report

Higher paid NBA players are (probably) deserving

Willem Thorbecke

The motivating question was to find out whether or not there existed a connection between the salary of an NBA player and his performance in the league. Using the statistic Player Efficiency Rating (PER), an NBA statistic commonly used to measure a player's overall performance in the league, I compared player salaries and performances. With a correlation of 0.5 between salaries and PERs across the leauge, as well as a Spearman Correlation of 0.4, I came to the conclusion that there was a slight correlation between the two variables, and thus higher paid NBA players may be deserving of their paychecks.
Report

Murder, Ink — A statistical analysis of tattoos in the Florida prison system

Joey Maalouf, Matthew Ruehle, Sean Carter

We examine the claims made in an Economist article on prison tattoos. Examining a publicly-available inmate database, we found that there are several noticeable trends between tattoos and types of criminal conviction. Our results are not necessarily causative, and may reflect either societal biases or demographic trends. Nonetheless, the data demonstrates a strong correlation between different categories of "ink" and criminal classifications.
Report

Are more selective or expensive colleges worth it?

William Lu

As costs to attend college increase, an increasing number of high school seniors are left wondering if they should or must select a more affordable college. Many Americans go to college not just to gain a higher education, but also to increase their earning potential later in life. Using US Department of Education College Scorecard data, I found that going to a more expensive college could potentially make you more money in the future, that more selective colleges don't necessarily cost more, and that more selective colleges don't necessarily make you more money in the future.
Report

Are Diseases of the Heart Seasonal?

Radmer van der Heyde

In this report, I sought to answer the question: does heart disease have seasonality like that of Influenza? To answer this, I explored the CDC's Wonder database on the underlying causes of death on the monthly data for the state of California. Based on my results, the majority of heart diseases show some seasonality as the dominant frequency component is at the frequency corresponding to a period of 1 year.
Report

A nice Bayes theorem problem: medical testing

2017-02-16T07:37:00.002-08:00

On these previous post about my favorite Bayes theorem problems, I got the following comment from a reader named Riya:

I have a question. Exactly 1/5th of the people in a town have Beaver Fever . There are two tests for Beaver Fever, TEST1 and TEST2. When a person goes to a doctor to test for Beaver Fever, with probability 2/3 the doctor conducts TEST1 on him and with probability 1/3 the doctor conducts TEST2 on him. When TEST1 is done on a person, the outcome is as follows: If the person has the disease, the result is positive with probability 3/4. If the person does not have the disease, the result is positive with probability 1/4. When TEST2 is done on a person, the outcome is as follows: If the person has the disease, the result is positive with probability 1. If the person does not have the disease, the result is positive with probability 1/2. A person is picked uniformly at random from the town and is sent to a doctor to test for Beaver Fever. The result comes out positive. What is the probability that the person has the disease?

I think this is an excellent question, so I am passing it along to the readers of this blog. One suggestion: you might want to use my world famous Bayesian update worksheet.

Hint: This question is similar to one I wrote about last year. In that article, I started with a problem that was underspecified; it took a while for me to realize that there were several ways to formulate the problem, with different answers.

Fortunately, the problem posed by Riya is completely specified; it is an example of what I called Scenario A, where there are two tests with different properties, and we don't know which test was used.

There are several ways to proceed, but I recommend writing four hypotheses that specify the test and the status of the patient:

TEST1 and sick
TEST1 and not sick
TEST2 and sick
TEST2 and not sick

For each of these hypotheses, it is straightforward to compute the prior probability and the likelihood of a positive test. From there, it's just arithmetic.

Here's what it looks like using my world famous Bayesian update worksheet:

(Now with more smudges because I had an arithmetic error the first time. Thanks, Ben Torvaney, for pointing it out.)

After the update, the total probability that the patient is sick is 10/26 or about 38%. That's up from the prior, which was 1/5 or 20%. So the positive test is evidence that the patient is sick, but it is not very strong evidence.

Interestingly, the total posterior probability of TEST2 is 12/26 or about 46%. That's up from the prior, which was 33%. So the positive test provides some evidence that TEST2 was used.

Last batch of notebooks for Think Stats

2017-01-16T08:10:00.000-08:00

Getting ready to teach Data Science in the spring, I am going back through Think Stats and updating the Jupyter notebooks. Each chapter has a notebook that shows the examples from the book along with some small exercises, with more substantial exercises at the end.

If you are reading the book, you can get the notebooks by cloning this repository on GitHub, and running the notebooks on your computer.

Or you can read (but not run) the notebooks on GitHub:

Chapter 13 Notebook (Chapter 13 Solutions)
Chapter 14 Notebook (Chapter 14 Solutions)

I am done now, just in time for the semester to start, tomorrow! Here are some of the examples from Chapter 13, on survival analysis:

Survival analysis¶

If we have an unbiased sample of complete lifetimes, we can compute the survival function from the CDF and the hazard function from the survival function.
Here's the distribution of pregnancy length in the NSFG dataset.

In [2]:

import nsfg

preg = nsfg.ReadFemPreg()
complete = preg.query('outcome in [1, 3, 4]').prglngth
cdf = thinkstats2.Cdf(complete, label='cdf')

The survival function is just the complementary CDF.

In [3]:

import survival

def MakeSurvivalFromCdf(cdf, label=''):
    """Makes a survival function based on a CDF.

    cdf: Cdf
    
    returns: SurvivalFunction
    """
    ts = cdf.xs
    ss = 1 - cdf.ps
    return survival.SurvivalFunction(ts, ss, label)

In [4]:

sf = MakeSurvivalFromCdf(cdf, label='survival')

In [5]:

print(cdf[13])
print(sf[13])

0.13978014121
0.86021985879

Here's the CDF and SF.

In [6]:

thinkplot.Plot(sf)
thinkplot.Cdf(cdf, alpha=0.2)
thinkplot.Config(loc='center left')

And here's the hazard function.

In [7]:

hf = sf.MakeHazardFunction(label='hazard')
print(hf[39])

0.676706827309

In [8]:

thinkplot.Plot(hf)
thinkplot.Config(ylim=[0, 0.75], loc='upper left')

Age at first marriage¶

We'll use the NSFG respondent file to estimate the hazard function and survival function for age at first marriage.

In [9]:

resp6 = nsfg.ReadFemResp()

We have to clean up a few variables.

In [10]:

resp6.cmmarrhx.replace([9997, 9998, 9999], np.nan, inplace=True)
resp6['agemarry'] = (resp6.cmmarrhx - resp6.cmbirth) / 12.0
resp6['age'] = (resp6.cmintvw - resp6.cmbirth) / 12.0

And the extract the age at first marriage for people who are married, and the age at time of interview for people who are not.

In [11]:

complete = resp6[resp6.evrmarry==1].agemarry.dropna()
ongoing = resp6[resp6.evrmarry==0].age

The following function uses Kaplan-Meier to estimate the hazard function.

In [12]:

from collections import Counter

def EstimateHazardFunction(complete, ongoing, label='', verbose=False):
    """Estimates the hazard function by Kaplan-Meier.

    http://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator

    complete: list of complete lifetimes
    ongoing: list of ongoing lifetimes
    label: string
    verbose: whether to display intermediate results
    """
    if np.sum(np.isnan(complete)):
        raise ValueError("complete contains NaNs")
    if np.sum(np.isnan(ongoing)):
        raise ValueError("ongoing contains NaNs")

    hist_complete = Counter(complete)
    hist_ongoing = Counter(ongoing)

    ts = list(hist_complete | hist_ongoing)
    ts.sort()

    at_risk = len(complete) + len(ongoing)

    lams = pd.Series(index=ts)
    for t in ts:
        ended = hist_complete[t]
        censored = hist_ongoing[t]

        lams[t] = ended / at_risk
        if verbose:
            print(t, at_risk, ended, censored, lams[t])
        at_risk -= ended + censored

    return survival.HazardFunction(lams, label=label)

Here is the hazard function and corresponding survival function.

In [13]:

hf = EstimateHazardFunction(complete, ongoing)
thinkplot.Plot(hf)
thinkplot.Config(xlabel='Age (years)',
                 ylabel='Hazard')

In [14]:

sf = hf.MakeSurvival()
thinkplot.Plot(sf)
thinkplot.Config(xlabel='Age (years)',
                 ylabel='Prob unmarried',
                 ylim=[0, 1])

Quantifying uncertainty¶

To see how much the results depend on random sampling, we'll use a resampling process again.

In [15]:

def EstimateMarriageSurvival(resp):
    """Estimates the survival curve.

    resp: DataFrame of respondents

    returns: pair of HazardFunction, SurvivalFunction
    """
    # NOTE: Filling missing values would be better than dropping them.
    complete = resp[resp.evrmarry == 1].agemarry.dropna()
    ongoing = resp[resp.evrmarry == 0].age

    hf = EstimateHazardFunction(complete, ongoing)
    sf = hf.MakeSurvival()

    return hf, sf

In [16]:

def ResampleSurvival(resp, iters=101):
    """Resamples respondents and estimates the survival function.

    resp: DataFrame of respondents
    iters: number of resamples
    """ 
    _, sf = EstimateMarriageSurvival(resp)
    thinkplot.Plot(sf)

    low, high = resp.agemarry.min(), resp.agemarry.max()
    ts = np.arange(low, high, 1/12.0)

    ss_seq = []
    for _ in range(iters):
        sample = thinkstats2.ResampleRowsWeighted(resp)
        _, sf = EstimateMarriageSurvival(sample)
        ss_seq.append(sf.Probs(ts))

    low, high = thinkstats2.PercentileRows(ss_seq, [5, 95])
    thinkplot.FillBetween(ts, low, high, color='gray', label='90% CI')

The following plot shows the survival function based on the raw data and a 90% CI based on resampling.

In [17]:

ResampleSurvival(resp6)
thinkplot.Config(xlabel='Age (years)',
                 ylabel='Prob unmarried',
                 xlim=[12, 46],
                 ylim=[0, 1],
                 loc='upper right')

The SF based on the raw data falls outside the 90% CI because the CI is based on weighted resampling, and the raw data is not. You can confirm that by replacing ResampleRowsWeighted with ResampleRows in ResampleSurvival.

More data¶

To generate survivial curves for each birth cohort, we need more data, which we can get by combining data from several NSFG cycles.

In [18]:

resp5 = survival.ReadFemResp1995()
resp6 = survival.ReadFemResp2002()
resp7 = survival.ReadFemResp2010()

In [19]:

resps = [resp5, resp6, resp7]

The following is the code from survival.py that generates SFs broken down by decade of birth.

In [20]:

def AddLabelsByDecade(groups, **options):
    """Draws fake points in order to add labels to the legend.

    groups: GroupBy object
    """
    thinkplot.PrePlot(len(groups))
    for name, _ in groups:
        label = '%d0s' % name
        thinkplot.Plot([15], [1], label=label, **options)

def EstimateMarriageSurvivalByDecade(groups, **options):
    """Groups respondents by decade and plots survival curves.

    groups: GroupBy object
    """
    thinkplot.PrePlot(len(groups))
    for _, group in groups:
        _, sf = EstimateMarriageSurvival(group)
        thinkplot.Plot(sf, **options)

def PlotResampledByDecade(resps, iters=11, predict_flag=False, omit=None):
    """Plots survival curves for resampled data.

    resps: list of DataFrames
    iters: number of resamples to plot
    predict_flag: whether to also plot predictions
    """
    for i in range(iters):
        samples = [thinkstats2.ResampleRowsWeighted(resp) 
                   for resp in resps]
        sample = pd.concat(samples, ignore_index=True)
        groups = sample.groupby('decade')

        if omit:
            groups = [(name, group) for name, group in groups 
                      if name not in omit]

        # TODO: refactor this to collect resampled estimates and
        # plot shaded areas
        if i == 0:
            AddLabelsByDecade(groups, alpha=0.7)

        if predict_flag:
            PlotPredictionsByDecade(groups, alpha=0.1)
            EstimateMarriageSurvivalByDecade(groups, alpha=0.1)
        else:
            EstimateMarriageSurvivalByDecade(groups, alpha=0.2)

Here are the results for the combined data.

In [21]:

PlotResampledByDecade(resps)
thinkplot.Config(xlabel='Age (years)',
                   ylabel='Prob unmarried',
                   xlim=[13, 45],
                   ylim=[0, 1])

We can generate predictions by assuming that the hazard function of each generation will be the same as for the previous generation.

In [22]:

def PlotPredictionsByDecade(groups, **options):
    """Groups respondents by decade and plots survival curves.

    groups: GroupBy object
    """
    hfs = []
    for _, group in groups:
        hf, sf = EstimateMarriageSurvival(group)
        hfs.append(hf)

    thinkplot.PrePlot(len(hfs))
    for i, hf in enumerate(hfs):
        if i > 0:
            hf.Extend(hfs[i-1])
        sf = hf.MakeSurvival()
        thinkplot.Plot(sf, **options)

And here's what that looks like.

In [23]:

PlotResampledByDecade(resps, predict_flag=True)
thinkplot.Config(xlabel='Age (years)',
                 ylabel='Prob unmarried',
                 xlim=[13, 45],
                 ylim=[0, 1])

Remaining lifetime¶

Distributions with difference shapes yield different behavior for remaining lifetime as a function of age.

In [24]:

preg = nsfg.ReadFemPreg()

complete = preg.query('outcome in [1, 3, 4]').prglngth
print('Number of complete pregnancies', len(complete))
ongoing = preg[preg.outcome == 6].prglngth
print('Number of ongoing pregnancies', len(ongoing))

hf = EstimateHazardFunction(complete, ongoing)
sf1 = hf.MakeSurvival()

Number of complete pregnancies 11189
Number of ongoing pregnancies 352

Here's the expected remaining duration of a pregnancy as a function of the number of weeks elapsed. After week 36, the process becomes "memoryless".

In [25]:

rem_life1 = sf1.RemainingLifetime()
thinkplot.Plot(rem_life1)
thinkplot.Config(title='Remaining pregnancy length',
                 xlabel='Weeks',
                 ylabel='Mean remaining weeks')

And here's the median remaining time until first marriage as a function of age.

In [26]:

hf, sf2 = EstimateMarriageSurvival(resp6)

In [27]:

func = lambda pmf: pmf.Percentile(50)
rem_life2 = sf2.RemainingLifetime(filler=np.inf, func=func)
    
thinkplot.Plot(rem_life2)
thinkplot.Config(title='Years until first marriage',
                 ylim=[0, 15],
                 xlim=[11, 31],
                 xlabel='Age (years)',
                 ylabel='Median remaining years')

Exercises¶

Exercise: In NSFG Cycles 6 and 7, the variable cmdivorcx contains the date of divorce for the respondent’s first marriage, if applicable, encoded in century-months.
Compute the duration of marriages that have ended in divorce, and the duration, so far, of marriages that are ongoing. Estimate the hazard and survival curve for the duration of marriage.
Use resampling to take into account sampling weights, and plot data from several resamples to visualize sampling error.
Consider dividing the respondents into groups by decade of birth, and possibly by age at first marriage.

In [28]:

def CleanData(resp):
    """Cleans respondent data.

    resp: DataFrame
    """
    resp.cmdivorcx.replace([9998, 9999], np.nan, inplace=True)

    resp['notdivorced'] = resp.cmdivorcx.isnull().astype(int)
    resp['duration'] = (resp.cmdivorcx - resp.cmmarrhx) / 12.0
    resp['durationsofar'] = (resp.cmintvw - resp.cmmarrhx) / 12.0

    month0 = pd.to_datetime('1899-12-15')
    dates = [month0 + pd.DateOffset(months=cm) 
             for cm in resp.cmbirth]
    resp['decade'] = (pd.DatetimeIndex(dates).year - 1900) // 10

In [29]:

CleanData(resp6)
married6 = resp6[resp6.evrmarry==1]

CleanData(resp7)
married7 = resp7[resp7.evrmarry==1]

In [30]:

# Solution

def ResampleDivorceCurve(resps):
    """Plots divorce curves based on resampled data.

    resps: list of respondent DataFrames
    """
    for _ in range(11):
        samples = [thinkstats2.ResampleRowsWeighted(resp) 
                   for resp in resps]
        sample = pd.concat(samples, ignore_index=True)
        PlotDivorceCurveByDecade(sample, color='#225EA8', alpha=0.1)

    thinkplot.Show(xlabel='years',
                   axis=[0, 28, 0, 1])

In [31]:

# Solution

def ResampleDivorceCurveByDecade(resps):
    """Plots divorce curves for each birth cohort.

    resps: list of respondent DataFrames    
    """
    for i in range(41):
        samples = [thinkstats2.ResampleRowsWeighted(resp) 
                   for resp in resps]
        sample = pd.concat(samples, ignore_index=True)
        groups = sample.groupby('decade')
        if i == 0:
            survival.AddLabelsByDecade(groups, alpha=0.7)

        EstimateSurvivalByDecade(groups, alpha=0.1)

    thinkplot.Config(xlabel='Years',
                     ylabel='Fraction undivorced',
                     axis=[0, 28, 0, 1])

In [32]:

# Solution

def EstimateSurvivalByDecade(groups, **options):
    """Groups respondents by decade and plots survival curves.

    groups: GroupBy object
    """
    thinkplot.PrePlot(len(groups))
    for name, group in groups:
        _, sf = EstimateSurvival(group)
        thinkplot.Plot(sf, **options)

In [33]:

# Solution

def EstimateSurvival(resp):
    """Estimates the survival curve.

    resp: DataFrame of respondents

    returns: pair of HazardFunction, SurvivalFunction
    """
    complete = resp[resp.notdivorced == 0].duration.dropna()
    ongoing = resp[resp.notdivorced == 1].durationsofar.dropna()

    hf = survival.EstimateHazardFunction(complete, ongoing)
    sf = hf.MakeSurvival()

    return hf, sf

In [34]:

# Solution

ResampleDivorceCurveByDecade([married6, married7])

In [ ]:

Another batch of Think Stats notebooks

2017-01-13T12:38:00.000-08:00

Getting ready to teach Data Science in the spring, I am going back through Think Stats and updating the Jupyter notebooks. When I am done, each chapter will have a notebook that shows the examples from the book along with some small exercises, with more substantial exercises at the end.

If you are reading the book, you can get the notebooks by cloning this repository on GitHub, and running the notebooks on your computer.

Or you can read (but not run) the notebooks on GitHub:

Chapter 10 Notebook (Chapter 10 Solutions)
Chapter 11 Notebook (Chapter 11 Solutions)
Chapter 12 Notebook (Chapter 12 Solutions)

I'll post the last two soon, but in the meantime you can see some of the more interesting exercises, and solutions, below.

Time series analysis¶

Load the data from "Price of Weed".

In [2]:

transactions = pd.read_csv('mj-clean.csv', parse_dates=[5])
transactions.head()

Out[2]:

	city	state	price	amount	quality	date	ppg	state.name	lat	lon
0	Annandale	VA	100	7.075	high	2010-09-02	14.13	Virginia	38.830345	-77.213870
1	Auburn	AL	60	28.300	high	2010-09-02	2.12	Alabama	32.578185	-85.472820
2	Austin	TX	60	28.300	medium	2010-09-02	2.12	Texas	30.326374	-97.771258
3	Belleville	IL	400	28.300	high	2010-09-02	14.13	Illinois	38.532311	-89.983521
4	Boone	NC	55	3.540	high	2010-09-02	15.54	North Carolina	36.217052	-81.687983

The following function takes a DataFrame of transactions and compute daily averages.

In [3]:

def GroupByDay(transactions, func=np.mean):
    """Groups transactions by day and compute the daily mean ppg.

    transactions: DataFrame of transactions

    returns: DataFrame of daily prices
    """
    grouped = transactions[['date', 'ppg']].groupby('date')
    daily = grouped.aggregate(func)

    daily['date'] = daily.index
    start = daily.date[0]
    one_year = np.timedelta64(1, 'Y')
    daily['years'] = (daily.date - start) / one_year

    return daily

The following function returns a map from quality name to a DataFrame of daily averages.

In [4]:

def GroupByQualityAndDay(transactions):
    """Divides transactions by quality and computes mean daily price.

    transaction: DataFrame of transactions
    
    returns: map from quality to time series of ppg
    """
    groups = transactions.groupby('quality')
    dailies = {}
    for name, group in groups:
        dailies[name] = GroupByDay(group)        

    return dailies

dailies is the map from quality name to DataFrame.

In [5]:

dailies = GroupByQualityAndDay(transactions)

The following plots the daily average price for each quality.

In [6]:

import matplotlib.pyplot as plt

thinkplot.PrePlot(rows=3)
for i, (name, daily) in enumerate(dailies.items()):
    thinkplot.SubPlot(i+1)
    title = 'Price per gram ($)' if i == 0 else ''
    thinkplot.Config(ylim=[0, 20], title=title)
    thinkplot.Scatter(daily.ppg, s=10, label=name)
    if i == 2: 
        plt.xticks(rotation=30)
        thinkplot.Config()
    else:
        thinkplot.Config(xticks=[])

We can use statsmodels to run a linear model of price as a function of time.

In [7]:

import statsmodels.formula.api as smf

def RunLinearModel(daily):
    model = smf.ols('ppg ~ years', data=daily)
    results = model.fit()
    return model, results

Here's what the results look like.

In [8]:

from IPython.display import display

for name, daily in dailies.items():
    model, results = RunLinearModel(daily)
    print(name)
    display(results.summary())

high

OLS Regression Results
Dep. Variable:	ppg	R-squared:	0.444
Model:	OLS	Adj. R-squared:	0.444
Method:	Least Squares	F-statistic:	989.7
Date:	Wed, 04 Jan 2017	Prob (F-statistic):	3.69e-160
Time:	11:44:14	Log-Likelihood:	-1510.1
No. Observations:	1241	AIC:	3024.
Df Residuals:	1239	BIC:	3035.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	13.4496	0.045	296.080	0.000	13.361 13.539
years	-0.7082	0.023	-31.460	0.000	-0.752 -0.664

Omnibus:	56.254	Durbin-Watson:	1.847
Prob(Omnibus):	0.000	Jarque-Bera (JB):	128.992
Skew:	0.252	Prob(JB):	9.76e-29
Kurtosis:	4.497	Cond. No.	4.71

medium

OLS Regression Results
Dep. Variable:	ppg	R-squared:	0.050
Model:	OLS	Adj. R-squared:	0.049
Method:	Least Squares	F-statistic:	64.92
Date:	Wed, 04 Jan 2017	Prob (F-statistic):	1.82e-15
Time:	11:44:14	Log-Likelihood:	-2053.9
No. Observations:	1238	AIC:	4112.
Df Residuals:	1236	BIC:	4122.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	8.8791	0.071	125.043	0.000	8.740 9.018
years	0.2832	0.035	8.057	0.000	0.214 0.352

Omnibus:	133.025	Durbin-Watson:	1.767
Prob(Omnibus):	0.000	Jarque-Bera (JB):	630.863
Skew:	0.385	Prob(JB):	1.02e-137
Kurtosis:	6.411	Cond. No.	4.73

low

OLS Regression Results
Dep. Variable:	ppg	R-squared:	0.030
Model:	OLS	Adj. R-squared:	0.029
Method:	Least Squares	F-statistic:	35.90
Date:	Wed, 04 Jan 2017	Prob (F-statistic):	2.76e-09
Time:	11:44:14	Log-Likelihood:	-3091.3
No. Observations:	1179	AIC:	6187.
Df Residuals:	1177	BIC:	6197.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	5.3616	0.194	27.671	0.000	4.981 5.742
years	0.5683	0.095	5.991	0.000	0.382 0.754

Omnibus:	649.338	Durbin-Watson:	1.820
Prob(Omnibus):	0.000	Jarque-Bera (JB):	6347.614
Skew:	2.373	Prob(JB):	0.00
Kurtosis:	13.329	Cond. No.	4.85

Now let's plot the fitted model with the data.

In [9]:

def PlotFittedValues(model, results, label=''):
    """Plots original data and fitted values.

    model: StatsModel model object
    results: StatsModel results object
    """
    years = model.exog[:,1]
    values = model.endog
    thinkplot.Scatter(years, values, s=15, label=label)
    thinkplot.Plot(years, results.fittedvalues, label='model', color='#ff7f00')

The following function plots the original data and the fitted curve.

In [10]:

def PlotLinearModel(daily, name):
    """Plots a linear fit to a sequence of prices, and the residuals.
    
    daily: DataFrame of daily prices
    name: string
    """
    model, results = RunLinearModel(daily)
    PlotFittedValues(model, results, label=name)
    thinkplot.Config(title='Fitted values',
                     xlabel='Years',
                     xlim=[-0.1, 3.8],
                     ylabel='Price per gram ($)')

Here are results for the high quality category:

In [11]:

name = 'high'
daily = dailies[name]

PlotLinearModel(daily, name)

Moving averages¶

As a simple example, I'll show the rolling average of the numbers from 1 to 10.

In [12]:

series = np.arange(10)

With a "window" of size 3, we get the average of the previous 3 elements, or nan when there are fewer than 3.

In [13]:

pd.rolling_mean(series, 3)

Out[13]:

array([ nan,  nan,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.])

The following function plots the rolling mean.

In [14]:

def PlotRollingMean(daily, name):
    """Plots rolling mean.

    daily: DataFrame of daily prices
    """
    dates = pd.date_range(daily.index.min(), daily.index.max())
    reindexed = daily.reindex(dates)

    thinkplot.Scatter(reindexed.ppg, s=15, alpha=0.2, label=name)
    roll_mean = pd.rolling_mean(reindexed.ppg, 30)
    thinkplot.Plot(roll_mean, label='rolling mean', color='#ff7f00')
    plt.xticks(rotation=30)
    thinkplot.Config(ylabel='price per gram ($)')

Here's what it looks like for the high quality category.

In [15]:

PlotRollingMean(daily, name)

The exponentially-weighted moving average gives more weight to more recent points.

In [16]:

def PlotEWMA(daily, name):
    """Plots rolling mean.

    daily: DataFrame of daily prices
    """
    dates = pd.date_range(daily.index.min(), daily.index.max())
    reindexed = daily.reindex(dates)

    thinkplot.Scatter(reindexed.ppg, s=15, alpha=0.2, label=name)
    roll_mean = pd.ewma(reindexed.ppg, 30)
    thinkplot.Plot(roll_mean, label='EWMA', color='#ff7f00')
    plt.xticks(rotation=30)
    thinkplot.Config(ylabel='price per gram ($)')

In [17]:

PlotEWMA(daily, name)

We can use resampling to generate missing values with the right amount of noise.

In [18]:

def FillMissing(daily, span=30):
    """Fills missing values with an exponentially weighted moving average.

    Resulting DataFrame has new columns 'ewma' and 'resid'.

    daily: DataFrame of daily prices
    span: window size (sort of) passed to ewma

    returns: new DataFrame of daily prices
    """
    dates = pd.date_range(daily.index.min(), daily.index.max())
    reindexed = daily.reindex(dates)

    ewma = pd.ewma(reindexed.ppg, span=span)

    resid = (reindexed.ppg - ewma).dropna()
    fake_data = ewma + thinkstats2.Resample(resid, len(reindexed))
    reindexed.ppg.fillna(fake_data, inplace=True)

    reindexed['ewma'] = ewma
    reindexed['resid'] = reindexed.ppg - ewma
    return reindexed

In [19]:

def PlotFilled(daily, name):
    """Plots the EWMA and filled data.

    daily: DataFrame of daily prices
    """
    filled = FillMissing(daily, span=30)
    thinkplot.Scatter(filled.ppg, s=15, alpha=0.2, label=name)
    thinkplot.Plot(filled.ewma, label='EWMA', color='#ff7f00')
    plt.xticks(rotation=30)
    thinkplot.Config(ylabel='Price per gram ($)')

Here's what the EWMA model looks like with missing values filled.

In [20]:

PlotFilled(daily, name)

Serial correlation¶

The following function computes serial correlation with the given lag.

In [21]:

def SerialCorr(series, lag=1):
    xs = series[lag:]
    ys = series.shift(lag)[lag:]
    corr = thinkstats2.Corr(xs, ys)
    return corr

Before computing correlations, we'll fill missing values.

In [22]:

filled_dailies = {}
for name, daily in dailies.items():
    filled_dailies[name] = FillMissing(daily, span=30)

Here are the serial correlations for raw price data.

In [23]:

for name, filled in filled_dailies.items():            
    corr = thinkstats2.SerialCorr(filled.ppg, lag=1)
    print(name, corr)

high 0.480157057617
medium 0.1736217448
low 0.119991439308

It's not surprising that there are correlations between consecutive days, because there are obvious trends in the data.

It is more interested to see whether there are still correlations after we subtract away the trends.

In [24]:

for name, filled in filled_dailies.items():            
    corr = thinkstats2.SerialCorr(filled.resid, lag=1)
    print(name, corr)

high -0.0164485711933
medium -0.0181750319436
low 0.0469291300611

Even if the correlations between consecutive days are weak, there might be correlations across intervals of one week, one month, or one year.

In [25]:

rows = []
for lag in [1, 7, 30, 365]:
    print(lag, end='\t')
    for name, filled in filled_dailies.items():            
        corr = SerialCorr(filled.resid, lag)
        print('%.2g' % corr, end='\t')
    print()

1 -0.016 -0.018 0.047 
7 0.0032 -0.032 -0.019 
30 0.011 -0.0014 -0.017 
365 0.051 0.013 0.026

The strongest correlation is a weekly cycle in the medium quality category.

Autocorrelation¶

The autocorrelation function is the serial correlation computed for all lags.

We can use it to replicate the results from the previous section.

In [26]:

import statsmodels.tsa.stattools as smtsa

filled = filled_dailies['high']
acf = smtsa.acf(filled.resid, nlags=365, unbiased=True)
print('%0.2g, %.2g, %0.2g, %0.2g, %0.2g' % 
      (acf[0], acf[1], acf[7], acf[30], acf[365]))

1, -0.016, 0.0031, 0.011, 0.049

To get a sense of how much autocorrelation we should expect by chance, we can resample the data (which eliminates any actual autocorrelation) and compute the ACF.

In [27]:

def SimulateAutocorrelation(daily, iters=1001, nlags=40):
    """Resample residuals, compute autocorrelation, and plot percentiles.

    daily: DataFrame
    iters: number of simulations to run
    nlags: maximum lags to compute autocorrelation
    """
    # run simulations
    t = []
    for _ in range(iters):
        filled = FillMissing(daily, span=30)
        resid = thinkstats2.Resample(filled.resid)
        acf = smtsa.acf(resid, nlags=nlags, unbiased=True)[1:]
        t.append(np.abs(acf))

    high = thinkstats2.PercentileRows(t, [97.5])[0]
    low = -high
    lags = range(1, nlags+1)
    thinkplot.FillBetween(lags, low, high, alpha=0.2, color='gray')

The following function plots the actual autocorrelation for lags up to 40 days.

The flag add_weekly indicates whether we should add a simulated weekly cycle.

In [28]:

def PlotAutoCorrelation(dailies, nlags=40, add_weekly=False):
    """Plots autocorrelation functions.

    dailies: map from category name to DataFrame of daily prices
    nlags: number of lags to compute
    add_weekly: boolean, whether to add a simulated weekly pattern
    """
    thinkplot.PrePlot(3)
    daily = dailies['high']
    SimulateAutocorrelation(daily)

    for name, daily in dailies.items():

        if add_weekly:
            daily = AddWeeklySeasonality(daily)

        filled = FillMissing(daily, span=30)

        acf = smtsa.acf(filled.resid, nlags=nlags, unbiased=True)
        lags = np.arange(len(acf))
        thinkplot.Plot(lags[1:], acf[1:], label=name)

To show what a strong weekly cycle would look like, we have the option of adding a price increase of 1-2 dollars on Friday and Saturdays.

In [29]:

def AddWeeklySeasonality(daily):
    """Adds a weekly pattern.

    daily: DataFrame of daily prices

    returns: new DataFrame of daily prices
    """
    fri_or_sat = (daily.index.dayofweek==4) | (daily.index.dayofweek==5)
    fake = daily.copy()
    fake.ppg.loc[fri_or_sat] += np.random.uniform(0, 2, fri_or_sat.sum())
    return fake

Here's what the real ACFs look like. The gray regions indicate the levels we expect by chance.

In [30]:

axis = [0, 41, -0.2, 0.2]

PlotAutoCorrelation(dailies, add_weekly=False)
thinkplot.Config(axis=axis, 
                     loc='lower right',
                     ylabel='correlation',
                     xlabel='lag (day)')

Here's what it would look like if there were a weekly cycle.

In [31]:

PlotAutoCorrelation(dailies, add_weekly=True)
thinkplot.Config(axis=axis,
                 loc='lower right',
                 xlabel='lag (days)')

/home/downey/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.py:128: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)

Prediction¶

The simplest way to generate predictions is to use statsmodels to fit a model to the data, then use the predict method from the results.

In [32]:

def GenerateSimplePrediction(results, years):
    """Generates a simple prediction.

    results: results object
    years: sequence of times (in years) to make predictions for

    returns: sequence of predicted values
    """
    n = len(years)
    inter = np.ones(n)
    d = dict(Intercept=inter, years=years, years2=years**2)
    predict_df = pd.DataFrame(d)
    predict = results.predict(predict_df)
    return predict

In [33]:

def PlotSimplePrediction(results, years):
    predict = GenerateSimplePrediction(results, years)

    thinkplot.Scatter(daily.years, daily.ppg, alpha=0.2, label=name)
    thinkplot.plot(years, predict, color='#ff7f00')
    xlim = years[0]-0.1, years[-1]+0.1
    thinkplot.Config(title='Predictions',
                 xlabel='Years',
                 xlim=xlim,
                 ylabel='Price per gram ($)',
                 loc='upper right')

Here's what the prediction looks like for the high quality category, using the linear model.

In [34]:

name = 'high'
daily = dailies[name]

_, results = RunLinearModel(daily)
years = np.linspace(0, 5, 101)
PlotSimplePrediction(results, years)

When we generate predictions, we want to quatify the uncertainty in the prediction. We can do that by resampling. The following function fits a model to the data, computes residuals, then resamples from the residuals to general fake datasets. It fits the same model to each fake dataset and returns a list of results.

In [35]:

def SimulateResults(daily, iters=101, func=RunLinearModel):
    """Run simulations based on resampling residuals.

    daily: DataFrame of daily prices
    iters: number of simulations
    func: function that fits a model to the data

    returns: list of result objects
    """
    _, results = func(daily)
    fake = daily.copy()
    
    result_seq = []
    for _ in range(iters):
        fake.ppg = results.fittedvalues + thinkstats2.Resample(results.resid)
        _, fake_results = func(fake)
        result_seq.append(fake_results)

    return result_seq

To generate predictions, we take the list of results fitted to resampled data. For each model, we use the predict method to generate predictions, and return a sequence of predictions.

If add_resid is true, we add resampled residuals to the predicted values, which generates predictions that include predictive uncertainty (due to random noise) as well as modeling uncertainty (due to random sampling).

In [36]:

def GeneratePredictions(result_seq, years, add_resid=False):
    """Generates an array of predicted values from a list of model results.

    When add_resid is False, predictions represent sampling error only.

    When add_resid is True, they also include residual error (which is
    more relevant to prediction).
    
    result_seq: list of model results
    years: sequence of times (in years) to make predictions for
    add_resid: boolean, whether to add in resampled residuals

    returns: sequence of predictions
    """
    n = len(years)
    d = dict(Intercept=np.ones(n), years=years, years2=years**2)
    predict_df = pd.DataFrame(d)
    
    predict_seq = []
    for fake_results in result_seq:
        predict = fake_results.predict(predict_df)
        if add_resid:
            predict += thinkstats2.Resample(fake_results.resid, n)
        predict_seq.append(predict)

    return predict_seq

To visualize predictions, I show a darker region that quantifies modeling uncertainty and a lighter region that quantifies predictive uncertainty.

In [37]:

def PlotPredictions(daily, years, iters=101, percent=90, func=RunLinearModel):
    """Plots predictions.

    daily: DataFrame of daily prices
    years: sequence of times (in years) to make predictions for
    iters: number of simulations
    percent: what percentile range to show
    func: function that fits a model to the data
    """
    result_seq = SimulateResults(daily, iters=iters, func=func)
    p = (100 - percent) / 2
    percents = p, 100-p

    predict_seq = GeneratePredictions(result_seq, years, add_resid=True)
    low, high = thinkstats2.PercentileRows(predict_seq, percents)
    thinkplot.FillBetween(years, low, high, alpha=0.3, color='gray')

    predict_seq = GeneratePredictions(result_seq, years, add_resid=False)
    low, high = thinkstats2.PercentileRows(predict_seq, percents)
    thinkplot.FillBetween(years, low, high, alpha=0.5, color='gray')

Here are the results for the high quality category.

In [38]:

years = np.linspace(0, 5, 101)
thinkplot.Scatter(daily.years, daily.ppg, alpha=0.1, label=name)
PlotPredictions(daily, years)
xlim = years[0]-0.1, years[-1]+0.1
thinkplot.Config(title='Predictions',
                   xlabel='Years',
                   xlim=xlim,
                   ylabel='Price per gram ($)')

But there is one more source of uncertainty: how much past data should we use to build the model?

The following function generates a sequence of models based on different amounts of past data.

In [39]:

def SimulateIntervals(daily, iters=101, func=RunLinearModel):
    """Run simulations based on different subsets of the data.

    daily: DataFrame of daily prices
    iters: number of simulations
    func: function that fits a model to the data

    returns: list of result objects
    """
    result_seq = []
    starts = np.linspace(0, len(daily), iters).astype(int)

    for start in starts[:-2]:
        subset = daily[start:]
        _, results = func(subset)
        fake = subset.copy()

        for _ in range(iters):
            fake.ppg = (results.fittedvalues + 
                        thinkstats2.Resample(results.resid))
            _, fake_results = func(fake)
            result_seq.append(fake_results)

    return result_seq

And this function plots the results.

In [40]:

def PlotIntervals(daily, years, iters=101, percent=90, func=RunLinearModel):
    """Plots predictions based on different intervals.

    daily: DataFrame of daily prices
    years: sequence of times (in years) to make predictions for
    iters: number of simulations
    percent: what percentile range to show
    func: function that fits a model to the data
    """
    result_seq = SimulateIntervals(daily, iters=iters, func=func)
    p = (100 - percent) / 2
    percents = p, 100-p

    predict_seq = GeneratePredictions(result_seq, years, add_resid=True)
    low, high = thinkstats2.PercentileRows(predict_seq, percents)
    thinkplot.FillBetween(years, low, high, alpha=0.2, color='gray')

Here's what the high quality category looks like if we take into account uncertainty about how much past data to use.

In [41]:

name = 'high'
daily = dailies[name]

thinkplot.Scatter(daily.years, daily.ppg, alpha=0.1, label=name)
PlotIntervals(daily, years)
PlotPredictions(daily, years)
xlim = years[0]-0.1, years[-1]+0.1
thinkplot.Config(title='Predictions',
                 xlabel='Years',
                 xlim=xlim,
                 ylabel='Price per gram ($)')

Exercises¶

Exercise: The linear model I used in this chapter has the obvious drawback that it is linear, and there is no reason to expect prices to change linearly over time. We can add flexibility to the model by adding a quadratic term, as we did in Section 11.3.

Use a quadratic model to fit the time series of daily prices, and use the model to generate predictions. You will have to write a version of RunLinearModel that runs that quadratic model, but after that you should be able to reuse code from the chapter to generate predictions.

In [42]:

# Solution

def RunQuadraticModel(daily):
    """Runs a linear model of prices versus years.

    daily: DataFrame of daily prices

    returns: model, results
    """
    daily['years2'] = daily.years**2
    model = smf.ols('ppg ~ years + years2', data=daily)
    results = model.fit()
    return model, results

In [43]:

# Solution

name = 'high'
daily = dailies[name]

model, results = RunQuadraticModel(daily)
results.summary()

Out[43]:

OLS Regression Results
Dep. Variable:	ppg	R-squared:	0.455
Model:	OLS	Adj. R-squared:	0.454
Method:	Least Squares	F-statistic:	517.5
Date:	Wed, 04 Jan 2017	Prob (F-statistic):	4.57e-164
Time:	11:45:26	Log-Likelihood:	-1497.4
No. Observations:	1241	AIC:	3001.
Df Residuals:	1238	BIC:	3016.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	13.6980	0.067	205.757	0.000	13.567 13.829
years	-1.1171	0.084	-13.326	0.000	-1.282 -0.953
years2	0.1132	0.022	5.060	0.000	0.069 0.157

Omnibus:	49.112	Durbin-Watson:	1.885
Prob(Omnibus):	0.000	Jarque-Bera (JB):	113.885
Skew:	0.199	Prob(JB):	1.86e-25
Kurtosis:	4.430	Cond. No.	27.5

In [44]:

# Solution

PlotFittedValues(model, results, label=name)
thinkplot.Config(title='Fitted values',
                 xlabel='Years',
                 xlim=[-0.1, 3.8],
                 ylabel='price per gram ($)')

In [45]:

# Solution

years = np.linspace(0, 5, 101)
thinkplot.Scatter(daily.years, daily.ppg, alpha=0.1, label=name)
PlotPredictions(daily, years, func=RunQuadraticModel)
thinkplot.Config(title='predictions',
                 xlabel='Years',
                 xlim=[years[0]-0.1, years[-1]+0.1],
                 ylabel='Price per gram ($)')

Exercise: Write a definition for a class named SerialCorrelationTest that extends HypothesisTest from Section 9.2. It should take a series and a lag as data, compute the serial correlation of the series with the given lag, and then compute the p-value of the observed correlation.

Use this class to test whether the serial correlation in raw price data is statistically significant. Also test the residuals of the linear model and (if you did the previous exercise), the quadratic model.

In [46]:

# Solution

class SerialCorrelationTest(thinkstats2.HypothesisTest):
    """Tests serial correlations by permutation."""

    def TestStatistic(self, data):
        """Computes the test statistic.

        data: tuple of xs and ys
        """
        series, lag = data
        test_stat = abs(SerialCorr(series, lag))
        return test_stat

    def RunModel(self):
        """Run the model of the null hypothesis.

        returns: simulated data
        """
        series, lag = self.data
        permutation = series.reindex(np.random.permutation(series.index))
        return permutation, lag

In [47]:

# Solution

# test the correlation between consecutive prices

name = 'high'
daily = dailies[name]

series = daily.ppg
test = SerialCorrelationTest((series, 1))
pvalue = test.PValue()
print(test.actual, pvalue)

0.485229376195 0.0

In [48]:

# Solution

# test for serial correlation in residuals of the linear model

_, results = RunLinearModel(daily)
series = results.resid
test = SerialCorrelationTest((series, 1))
pvalue = test.PValue()
print(test.actual, pvalue)

0.0757047376751 0.011

In [49]:

# Solution

# test for serial correlation in residuals of the quadratic model

_, results = RunQuadraticModel(daily)
series = results.resid
test = SerialCorrelationTest((series, 1))
pvalue = test.PValue()
print(test.actual, pvalue)

0.0560730816129 0.041

Worked example: There are several ways to extend the EWMA model to generate predictions. One of the simplest is something like this:

Compute the EWMA of the time series and use the last point as an intercept, inter.
Compute the EWMA of differences between successive elements in the time series and use the last point as a slope, slope.
To predict values at future times, compute inter + slope * dt, where dt is the difference between the time of the prediction and the time of the last observation.

In [50]:

name = 'high'
daily = dailies[name]

filled = FillMissing(daily)
diffs = filled.ppg.diff()

thinkplot.plot(diffs)
plt.xticks(rotation=30)
thinkplot.Config(ylabel='Daily change in price per gram ($)')

In [51]:

filled['slope'] = pd.ewma(diffs, span=365)
thinkplot.plot(filled.slope[-365:])
plt.xticks(rotation=30)
thinkplot.Config(ylabel='EWMA of diff ($)')

In [52]:

# extract the last inter and the mean of the last 30 slopes
start = filled.index[-1]
inter = filled.ewma[-1]
slope = filled.slope[-30:].mean()

start, inter, slope

Out[52]:

(Timestamp('2014-05-13 00:00:00', offset='D'),
 10.929518765455491,
 -0.0025727727289879565)

In [54]:

# reindex the DataFrame, adding a year to the end
dates = pd.date_range(filled.index.min(), 
                      filled.index.max() + np.timedelta64(365, 'D'))
predicted = filled.reindex(dates)

In [55]:

# generate predicted values and add them to the end
predicted['date'] = predicted.index
one_day = np.timedelta64(1, 'D')
predicted['days'] = (predicted.date - start) / one_day
predict = inter + slope * predicted.days
predicted.ewma.fillna(predict, inplace=True)

In [56]:

# plot the actual values and predictions
thinkplot.Scatter(daily.ppg, alpha=0.1, label=name)
thinkplot.Plot(predicted.ewma, color='#ff7f00')

As an exercise, run this analysis again for the other quality categories.

In [ ]:

Probably Overthinking It

This blog has moved

Two hour marathon in 2031, maybe

Tom Bayes and the case of the double dice

The double dice problem

Part two

The Physics of Bungee Jumping

Inference in three hours

Bayesian Zig Zag

Some people hate custom libraries

Computing at Olin Q&A

Generational changes in support for gun laws

Variables

Results

Methodology

Support for gun control is decreasing in all age groups

Untangling age, period, and cohort effects

Testing for age effects

Testing for period effects

Violent crime rates

Breakdown by political views

Breakdown by race

Post-Columbine students do not support gun control

Support for gun control is lower among young adults

Other studies

The NRA regime

The six stages of computational science

The Six Stages of Computational Science

The Valley of Unreliable Science

Climbing out of the valley

Implications for practitioners

Implications for education

Learning to program is getting harder

Build your own SOTU

Computation in STEM Workshop

The retreat from religion is accelerating

Secularization in the Unites States

Summary of results

Religion in the United States

Spring 2017 Data Science reports

How Do You Predict Who Will Vote?

Designing the Optimal Employee Experience... For Employers

Does being married have an effect on your political views?

Should you Follow the Food Groups for Dietary Advice?

Are Terms of Home Mortgage Less Favorable Now Compared to Pre Mortgage Crisis?

Finding NBA Players with Similar Styles

Food Trinities and Recipe Completion

All the News in 2010 and 2012

Python as a way of thinking

Probability Mass Functions¶

Bayesian statistics¶

Honey, money, weather, terror

The Impact of Military Status on Income Bracket

Getting Treatment

What's the Chance You will Die Due to Terrorism?

Is There a Seasonality in the Response to Social Media Posts?

The Association Between Drug Usage and Depression

US Apiculture and Honey Production

Most Terrorism is Local

Does it get warmer before it rains?

Money, Murder, the Midwest, and More

How do Europeans feel about Jewish, Muslim, and Gypsy immigration?

Do Midwestern colleges have better ACT scores?

Rich or Poor: To Whom does it Matter More?

Do More Politically Informed People Identify as Liberal?

Money might buy you some happiness

Higher paid NBA players are (probably) deserving

Murder, Ink — A statistical analysis of tattoos in the Florida prison system

Are more selective or expensive colleges worth it?

Are Diseases of the Heart Seasonal?

A nice Bayes theorem problem: medical testing

Last batch of notebooks for Think Stats

Survival analysis¶

Age at first marriage¶

Quantifying uncertainty¶

More data¶

Remaining lifetime¶

Exercises¶

Another batch of Think Stats notebooks

Time series analysis¶