A.C. Thomas, Scientist

An NHL Shot Plotting System Using Hexagonal Binning, Which Must Of Course Be Named "Hextally"

2014-06-19T01:11:46Z

Earlier implementations of hexagonal bin plotting for NHL shots on goal were very productive, so thanks to those who gave feedback and helped to improve it.

The new version has many upgrades:

A permanent name: Hextally, to represent the hexagonal binning process and to follow the tradition of PECOTA and other player-named methods. The name of course came second, but I do find it funny that this applet and method is used to judge shooting skill, and is named for the only goaltender to score two goals on two shots.
Player charts! We can now look not only at the shots taken by each player, but also at the differing performances of the team when that player is on and off the ice.
Man situations: full strength, power-play/shorthanded and four-on-four are all available.
Rink adjustments. The number of shots on goal is maintained to be the same, but to correct the overall balance of shots by zone, I randomly select a number of shots to move to a neighboring zone, such that the proportion of shots of each type is the same at home and away. (Snap shots and wrist shots were pooled due to the systemic confusion between these types by the official scorers.)
Adjustment for small sample sizes. Since the method estimates shooting rates per 60 minutes, players that have low time-on-ice in a particular scienario, like a penalty killer with minimal power-play time, will have rates with high variances. To compensate, we add "fake" shots on goal to each scoring zone with the same rate as the team without that player, with sufficient extra time up to 300 minutes of time on ice. (This was chosen on eye inspection and has not been peer-reviewed, but it suits the goal of evaluating players compared to their teammates.)

Comments are appreciated to improve both the presentation and the methods at hand!

Whither Location? Shot-Based Statistics Don't Just Measure Possession in Hockey

2014-05-14T00:30:08Z

Hockey writing these days is peppered with references to Corsi and Fenwick, which are fancier names for the differential of shot attempts made by and against a team, or with respect to when a particular player is on the ice. These are fairly predictive of future success (or failure) because they indicate the balance to which a team has possession of the puck in their offensive zone -- the two requisite conditions for scoring a goal under most circumstances.

And yet, the shorthand for this in the media is that these are proxies for puck possession, leaving location out in the cold, with some exceptions that are regrettably in the minority. Leave aside the notion that a blocked shot or a shot from the point is less valuable than a bona fide scoring chance; what's the actual consequence of this preference for possession over location?

If this was in widespread use 10 years ago, I'd have said that prioritizing possession over location would have made a negative impact for one reason: the era of the neutral zone trap and the explicit importance that successful teams placed on pinning the opposing team in their own zone meant that of those two elements, location really was supreme. In my very first published paper, I used a very limited but fun-to-collect data set to collect zone time and puck possession; for these games, it was clear that being in the offensive zone, but not having possession of the puck, was on average better for that team in terms of net goals scored in the ensuing seconds. And this was in a league that already had two-line passes.

The real danger isn't so much for people who are in the know, but for the increasing acceptance of #fancystats into the public sphere, it would be far too easy to assume that playing keep-away is necessarily better than playing dump-and-chase when it's a word, not actual number-crunching, that's pushing that point.

Pucksberry: Adapting Hexagonal Bin Plots for NHL Display

2014-04-30T03:20:22Z

One of my favorite classes of statistical graphics in sports media is the hexagonal bin plot, used by Grantland's Kirk Goldsberry to illuminate the shooting patterns and successes of shooters in the NBA. Combined with his access to the luxuriously rich SportVU data, Goldsberry has made a second career using a single graphic to tell stories (he's also a geography professor).

As of this NBA season, SportVU gives the x-y locations of all shots taken along with their success or failure in scoring, so Goldsberry has two variables to plot: the relative location of shots, and the proportion of shots that go in. These make for glorious comparisons to make a point, like how the Spurs dominated during their winning streak:

So of course, as a statistics professor who teaches graphics and does research on hockey, my first instinct is to ~~steal it for a massive profit~~ see how I can adopt, adapt and improve this method for the hockey community at large, particularly since x-y data on shot attempts has been available for the NHL since 2008.

So what are the big differences between NBA and NHL data that we have to bear in mind? And when do we get to see some pretty pictures? (The answer to both after the jump.)

]]> eyond that, the locations a player shoots from are often far more rigidly defined by their position -- defensemen will shoot from the extremities and typically from their standard side.

We're going to get the most leverage, then, out of applying these plots not to individual shooters but to full teams.

2) Among other things, the presence of a goaltender and the relative difference in puck travel time means that there's way more variability in shot success by location -- between 2 and 20 percent on average, rather than a spread of 30 to 70 percent in the NBA. Combined with the low success rate, it will be more difficult to establish any meaningful differences.

3) The NBA has two resolutions of its shot attempts: a miss and a basket. In contrast, the NHL has four -- shots that are blocked by players on the way to the net (BLOCK), shots that make it the distance but miss the net (MISS), shots that make contact with the goaltender (SHOT -- though a "save" is credited even if the shot would have missed the net), and goals.

Here's where the NHL data starts to show its seams:

GOAL and SHOT have both distance to the net and x-y coordinates.
MISS has distance, but no coordinates.
BLOCK has neither distance nor coordinates.

For the sake of these plots, we can impute x-y coordinates for MISS based on the SHOT distribution, but blocked shots have to be omitted from this plot.

4) Smoothing. We can color our bins by relative success, but the small number of successes will mean that strong colors will appear sporadically. We're better off either smoothing over the surface continuously, or assembling a secondary binning system. Goldsberry does the latter, which makes the most sense here because...

5) We'd like to know if any deviations in either count or success rate have any statistical significance, and it's way easier to establish this for a discrete set of bins -- the number of shots in a bin is best modelled as a Poisson distribution, and conditional on the total number of shots, the number of successes is Binomial. So we can then pick a series of regions on the ice that correspond both to known roles and to big changes in success.

6) Finally, NBA shooting assumes that possession between scoring attempts alternates, which means that overall shooting rates are roughly equal. A lot of emphasis has been placed on statistics that measure relative shooting rates, which means we'd be remiss in ignoring how these rates differ by zone. We can still compare relative successes, but comparing both of these will indicate what part of a team's success is true signal and which is noise.

With these priorities in mind, let's test out some examples on this past season at even strength. The unit of interest is the z-score of the rate or success probability, so that brighter colors mean stronger signals -- red is high, blue is low and grey is average.

Toronto Maple Leafs -- Shot Rates By Region

The Leafs were the second-most shot-upon team in the NHL this year, and it's clear that while much of that damage was on the perimeter -- low-probability shots were greatly higher than average, but in no region did the Leafs allow substantially less than the league average of shots per minute. And they attempt a very low number of shots compared to league average in the low slot, where success probabilities are the highest.

Toronto Maple Leafs -- Success Probabilities By Region

In no individual region do we have a shooting percentage that's significantly different from the league average. But the right plot is a rough proxy for goaltender ability, and Bernier/Reimer did mildly better than league average at stopping close-up shots.

Ottawa Senators -- Shot Rates and Success Probabilities By Region

Ottawa both attempted and allowed a large number of shots, though Ottawa's shots were on the perimeter -- the clear difference is in the scoring chances allowed in which they were dominated. The relative flatness of the shooting percentages suggests that this isn't a scorer's bias in awarding extra shot attempts.

In purely trivial matters, Ottawa allowed very few goals from their left point, but the threshold for "significant" would be crossed with roughly 3 more in total.

Los Angeles Kings -- Shot Rates and Success Probabilities By Region

LA's strength this season has been a uniform and relative decrease in shots allowed in all regions. If there is a slight overcount of shot attempts at the Staples center (as suggested by the low success rate for either team) then the bias isn't a major factor. I now kind of wish I had picked them in my pool to beat the Sharks in Round 1.

New Jersey Devils -- Shot Rates and Success Probabilities By Region

The Devils were also excellent at preventing shot attempts especially in the vulnerable slot area, and slightly above average at getting shot attempts in the slot themselves. They showed a surprisingly poor shooting percentage in the mid-slot region, also in the previous season. Goaltending was at just about the league average for slot shooting.

Further Examples

The applet used to create these plots was built in R Shiny, and is currently running here for anyone who would like to try the demo. The data extends back to the 2008-2009 season and has both event strength and 5-on-4 power-play and short-handed situations.

Update 4-30-14 10:05 AM: There have (of course) been other plots out there for this data. Brian Macdonald points out a few of these:

http://www.sportingcharts.com/nhl/icetrack/ -- produces heat and point maps for both players and teams, with two of these side by side. It's pretty but it doesn't provide the kind of meaningful context that I want.
http://somekindofninja.com/nhl/ -- point maps with successful goals marked.
Snapshot -- both a point map and a radial heat map for success at various distances. One of the authors is a former student of mine at CMU, but I can claim no credit for her work on this!

Update 4-30-14 1:42 PM: Schuckers has links to his spatially smoothed versions of save percentages, which he presented at Sloan in a previous year.

nhlscrapr: An R package whose purpose is right there in the name

2014-04-29T16:31:36Z

In putting together the game data from the NHL for the games we needed, my students and I (namely Sam Ventura) have been trying to dig through the NHL Real-Time Scoring System database, or at least what's facing the user side, that we could get data down to our desired data resolution. The data available from the NHL online extends back to the 2002-2003 season, though new elements appeared in 2007 (a better play-by-play data set) and again in 2008 (x-y coordinates of some events).

We decided that it would be in everyone's best interest to have this data available and sharable to everyone, but not where we would have to host this data ourselves, since the NHL is doing that already.

As a result, we created the R package nhlscrapr, which has been available on CRAN for some time but has been updated to be much more usable, particularly as new games are played and added to the NHL website.

Ultimately, everything we need for longer-term studies is contained in these tables:

1) A unique roster of all players, with de-duplication of cases when players change numbers, or scorekeeper-preferred spellings. Jean-Sebastian Giguere and J.S. Giguere might be the same person, but many of these spellings change over time on a game-by-game basis, and even if the NHL has unique player identifiers, we don't.

2) A table of all games played in the regular season and playoffs. (We cared less about preseason games for our tasks, mainly because of the excess of players who would not play in the NHL in future.)

3) A full, annotated and augmented play-by-play table for each recorded event, including (and especially!) player substitutions. This was most important for us as the unit of interest was the "shift" -- the contiguous space of time between events with the same players on the ice that ends with a noteworthy event. In the beginning this was simply goals and changes; we have since extended this to all events including shots, hits and penalties.

]]>
What It Downloads

Each game has several files that we needed for our database:

Play-by-play files (Old example, new example)
Team rosters, best obtained from the "event summary" file (old example, new example)
Shift charts. for newer files, the players on the ice are included in the play-by-play. For older ones, only graphics like this were available.
x-y location data, available from 2008-2009 onwards. This was harder to track down because it's not in the usual packet of events, but it is contained in a JSON file connected to the "ice tracker" applet.

These are downloaded for all games available. There are a large number of errors in games before 2005 that were not publicly corrected or updated, and so our records for those games are incomplete.

How To Use The Scraper

If you install and load the package and run the single function "compile.all.games()", it will do its thing. It may take a long time to download everything -- about 6 seconds per download for over 12,000 games. Let it run overnight if you want to do it this way.

Once this is done, you can repeat this command in future to obtain games that had yet to be played or to re-obtain games still in progress.

How It Does Its Thing

There are two pieces to the routine. First, it processes and combines a single game's pieces into a coherent whole -- replacing player numbers with their full names, adding x-y coordinates to those events for which they're available, indicating zone starts for each shift event, getting the number of players on the ice for each team, game score, separating the goaltenders from the skaters, and so forth. In the case of players on the ice in the earlier seasons, we literally dissect the pixels from the GIF file to determine the player on the ice during that second (the pixel depth is accurate to within 6 seconds, which is adequate for our purposes.)

Second, it goes back and fixes as many of those annoying problems that we can identify:

Player identities. With few exceptions, we know that players with the same first and last names, but different numbers, are the same person. There are some cases where we need to make a manual correction -- every Alexander Ovechkin is really an Alex -- and we hunted them down all by hand.
Distance discrepancies. In the early years, for example, the average distances of shots from the net at Madison Square Garden were far smaller than anywhere else. For each year, we adjust the stadium bias by taking the mean distance of a team's shots at home and away -- and making them identical through simple multiplication. (Average home and away distances in each stadium are remarkably similar, suggesting that there is little team-specific bias if any.)
Shot types. "Tip-in" and "Tip-In" are the same to us but not to a computer; same with "Wrap" and "Wrap-around". Establishing a standard set of names for these events, and directly indicating "unspecified" was necessary.

There are other quantities that are known to be biased by rink scorers -- no one assigns giveaways, takeaways or hits the same way, and so there can be massive differences in counts for those across stadiums. We have yet to use those in our analyses, aside from noting puck location and players on ice, so we leave them in the data to give us that insight and ignore them otherwise.

What's To Come?

Users can now create their own event summaries for each game with each player involved, including "advanced" measures like Fenwick/Corsi and zone starts. These can be compiled for each game as well as each player or team. We use this data for various purposes including our own player metrics, or worrying about empty-net situations, and we have more in the pipeline, but we suspect that many out there would like to have this sort of data available to them at a moment's notice. We admittedly didn't build it for anything other than our own research purposes, and we like to keep the functionality of this limited to downloading and processing data, not to introduce new methods; we prefer to develop other R packages and routines for that purpose. So stay tuned for more.

What's Pulling The Goalie Actually Worth To A Team?

2014-04-03T16:58:42Z

Summary: It's an innovation with no monetary cost, so let's figure out the actual gain to pulling the goalie earlier, and show that there's little harm to changing.

So apparently I can complain about progress a little bit. Pulling the goalie is one of my favorite topics in analytics for many reasons, but the biggest is that it feels like the easiest sell to make to teams as to why they should trust data-driven analysis: a change in strategy that costs no money to implement, no new assets to acquire and no new technology to trust.

When I deconstructed the pulled-goalie timing data even further, it became clear that the driving force was not earlier pulls in one-goal games but in two-goal games. Here are all the teams' average times divided by season and score differential when the goalie was pulled at even strength:

]]>

This tells us at least two important things: first, that said increase in the goalie pulling time can't really be attributed to close one-goal games; and second, that teams aren't being all that risky when it comes to pulling their goalie when down by 2; the average time for teams is still well below 2 minutes for most teams.

Now, I can't speak to the reluctance of coaches to be riskier in this situation -- or to Patrick Roy's aggressiveness, who's clearly gutsier both in his timing and his decision making. But I can say that when we break up the data like this, it's clear that pulling the goalie earlier doesn't hurt these teams on average -- and it probably helps.

According to the data collected from nhl.com using my R package nhlscrapr, teams pulled their goaltender with 1 goal down 4149 times in the last 11 seasons, and 2689 times when down by 2 goals. The trailing team scored 396 and 299 times respectively in those circumstances, which means that the pulled-goalie team scored a tying goal 9.5% of the time, and the cut-the-deficit-to-1 goal in 11% of attempts. Now it's certainly possible that a team leading by 2 goals might play more conservatively at even strength, but in empty net situations it's hard to believe they wouldn't go for the empty net any less aggressively.

And in the long run, worse teams are more likely to be down by 2 goals with time running out, which means this number is probably underdoing the effect -- teams that pull their goalies earlier score more often than those who wait until they are comfortable. Since the average team faces a 1-goal end game deficit in roughly 10 games a season, empirically, pulling 30 seconds earlier means an average gain of 0.15 more overtime games over a season; if every team aims for 90 points with a $60 million payroll, this means that adding 30 seconds to the empty net clock is a strategy worth $150,000.

If you believe in pulling the goalie earlier, then this change is magnified; the Poisson model for goal scoring suggests that pulling at the 3-minute mark, when down by 1, is most effective. According to the model, the probability that the team can tie the game when pulling at 3 minutes, compared to 105 seconds at even strength and 75 seconds with an empty net, is 20% to 17%, so that 0.3 more overtime games are expected over a season; other estimates, including earlier pulls in 2 goal deficit cases, push that number upwards of 0.5 games; conservatively, that's a strategy worth $400,000.

Sure, that ratio of games won to dollars spent isn't necessarily accurate -- even if I personally spent 0 dollars for 0 NHL wins -- but it's certainly in the ballpark for the values of contracts handed out for additional goals and wins respectively. And no organization would ever admit they weren't doing everything they could to win in a season. So why not get a little more aggressive with this strategy now? It won't be Patrick Roy's secret forever.

"Data Science" Is A Useful Label, Even If It's Usually 5% Science

2014-02-12T16:35:52Z

(Summary: I embrace the term Data Science because it lets us nurture a number of underappreciated talents in our students.)

One of the least developed skills that you'll find in the profession of statistics is how to name something appealingly. The discipline itself is a victim of that; not only is it less sexy than most of its competitors, the word has both plural and singular meanings. The plural is how the discipline is seen from the outside: a dry collection of summaries and figures boiled down to fit on the back of a baseball card, rather than its "singular" meaning: how the deal with uncertainty in data collection in a principled manner, which is a skill that, frankly, everyone we know can use.

The meaning comes out a bit better when you call it "the discipline of statistics", or "probability and statistics", which is connected but not identical, or the sleep-inducing "theoretical statistics" or seemingly redundant (but far cooler) "applied statistics". The buzz 10 years ago was to call it "statistical science", as if our whole process was governed by the scientific method, when math is developed by proof and construction and rarely by experiment or clinical observation.

We're seeing the whole thing cook up now with the emergence of the term Data Science, which again seems to have multiple meanings, depending on who you ask:

1) "Data Science" is a catch-all term for probabilistic inference and prediction, emerging as a kind of compromise to the statistics and machine learning communities. An expert in this kind of data science should be familiar with both inference and prediction as the end goal. This seems to be the term favored by academics, particularly in how they market these tools as the curriculum for Master's programs.

2) A "data scientist" is a professional who can manage the flow of data from its collection and initial processing into a form usable for standard inference and prediction routines, then report the results of these routines to a decision maker. This definition of "data science" as the process by which this happens is favored by people in industry. The idea that the source of this data should be "Big" is often assumed but not necessary.

It also doesn't help that the term has been coined at least 3 times in the past 10 years by 4 different people, each with a stake in making their definition stick; and as I will hammer in, isn't really science, but is so essential *to* good science that I'm willing to give it a mulligan.

So why would I step into what looks like a silly semantic debate? Partly because I'm paid to. I'm teaching these skills to multiple audiences, and over the course of the past year, two books by colleagues of mine have been published by O'Reilly: "Data Science for Business" by NYU professor Foster Provost and quasi-academic Tom Fawcett, and "Doing Data Science" by industry authorities Rachel Schutt and Cathy O'Neil. Both came about because of courses with the words "Data Science" in the title, at NYU and Columbia respectively; both make excellent reading for people who want to work with data in any meaningful capacity but like me prefer an informal style; and both will be on the recommended list when I teach R for Data Science again in the spring of 2014. It is also no accident that the content of Data Science for Business hews closer to the academic definition, and Doing Data Science, with its multiple contributions from industry specialists, lines right up with the industry definition.

The fact that I teach such a broad range of students, many of whom are very smart but technically inxperienced, is what's motivated me to think more deeply about process and less about particular skills. I'd have to guess that at best, the work I can do that I would call "science" is no more than a quarter of my total output. Yes, I build models, make inferences and predictions and design experiments, but the actual engineering I do is the clear dominating factor; I write code according to design principles as much as scientific thinking -- if I know a quick routine will take one-tenth the time but be 95% as accurate as a slower but more correct routine, I'll weigh which method to use in the long run by some other function.

For all these reasons, we should probably call it Data Engineering (or Data Flow Management) but we're stuck with Data Science as a popular, job defining label. Far from an embarrassment of language (says the man who has effectively admitted that his blog's name is exaggerated by a factor of four), my preferred interpretation of a Data Scientist takes the best part of the previous two:

3) Someone who is *trained* to examine unprocessed data, learn something about its underlying structural properties, construct the appropriate structured data set(s), uses those to fit inferential or predictive models (possibly of their own design) and effectively report on the consequences is someone who has earned the title of Data Scientist.

What I've seen in all my time in academia is the assumption that these ancillary skills are necessary but can -- if not should -- be self-taught, particularly for PhD students but even for MS students and undergraduates. Cosma's got it exactly right that any self-respecting graduate of our department should have those skills, but we never explicitly test them on it or venerate those students who prove it. And if the problem is getting rid of the posers, we need to do a lot better when it comes to emphasizing this in our culture. To add another term to the stew, do we need to emphasize Data Literacy as an explicit skill? Or would it not be easier to appropriate Data Science as a term that gets down to brass tacks?

Skating Toward Progress, 2.5 seconds Per Year

2014-02-04T20:21:25Z

I tuned in during third period action to watch the Avalanche play the Devils last night, while the Avs trailed 1-0 and realized I might see something special: Avs coach Patrick Roy pulling his goaltender earlier than other coaches would do. And of course, I looked away too early to see it actually happen, but there it was: Roy pulled J.S. Giguere with two and a half minutes to go in regulation, the Avs tied the game and won it in overtime. As someone convinced that NHL teams are far too conservative when it comes to pulling the goalie, that's one data point of vindication for pulling the goaltender earlier in the game! Right?

Well, sort of. While Roy's been known to pull the trigger far earlier than most, in his postgame comments he credited it to his instincts rather than his calculations: "sometimes you go with your feeling when to pull the goalie and fortunately it worked for us."

Still, Roy's Avalanche easily have the earliest empty-net trigger of any team in the last decade ~~when trailing by a single goal~~ in any end game situation:

The mean pull-time has also increased over the decade, from 61 seconds in 2002-2003 to 86 seconds through this season (not including last night's game), but no team has yet to approach the 3-minute mark in their average empty-net time -- the amount of time that most average Poisson-type models suggest is the minimum for this situation -- and only two are over the 2-minute mark at all. Still, I can't complain about progress!

Reflections on Teaching: Fall 2013

2013-11-11T22:15:31Z

I last wrote a teaching statement 3 years ago, and the number of things that has changed in the meantime is considerable. I've now taught lecture classes for undergrads, master's students and doctoral candidates, supervised individual projects, served on dissertation committees in several departments and co-authored multiple papers with students. As I think across all those experiences, there are things I've taken to heart and others I've considered and discarded; times I've taken chances and times I've played it safe.

Beyond that, technology has come a long way since then in terms of its immediate appli-

cability in the classroom, and when to take advantage of that has also become a key question. What follows is my experiences in that time and how they've affected my perspectives, with

examples from the classes I've taught - particularly the two courses I recently concluded

teaching, in Statistical Graphics and Programming in R.

]]> Lecture Time Is Too Valuable For Lectures

Lecture time used to be valuable -- back when its prime purpose was for a lecturer to read off a text for the audience to copy into their own notes. Today, the need to have facts communicated is taken care of by many different threads -- textbooks, online writeups, demoes, even Wikipedia in some cases -- and the lecture's place in teaching remains at the forefront, no matter how vestigial. It remains the main conduit through which students absorb their information, no matter how passively, even though there are better alternatives. More than that, given how much students pay for their education at an institution with the prestige of Carnegie Mellon, it's tantamount to fraud to not use that time efficiently.

It's impossible to break from it cold-turkey, though. Since laptop computers are now standard issue for nearly every student at university, I went with a model of half-lecture, half-lab for each of my two classes this past quarter; since both were essentially programming classes, the first half dealt with the theory, application and actual routines, where the second was an exercise left to the students to carry out to practice what they'd learned, ideally in randomized groups so that they could also meet each other. The idea was that instructor time, for me and my TAs, was better spent dealing with immediate coding issues, catching common mistakes, and ensuring that key points had been emphasized before they left.

What worked was obvious: we could immediately give meaningful feedback and see that they got a concept or not, if we could go around the room and see their progress (or lack thereof). Technical issues became known to the whole class quickly, particularly when something that worked in Windows didn't on a Mac or Linux machine. And while this wouldn't work for a theorem-proving class, it still suggests that some exercise would work better in that time -- reverse-classroom style questions or Socratic interrogation -- than having me at the chalkboard. And the times I was talking were better spent motivating the lesson from my perspective because I knew the students would actually try what I was suggesting.

What didn't work was trickier to spot. It's clear in retrospect that I should have provided more pre-class reading and exercises to both groups so that the lessons would be fresher and they'd be better warmed up for what was to come. (The fact that both classes were in the late afternoon couldn't be helped, but aren't an excuse either -- that should be adapted to as well.)

It's also clear that a difference in starting skills made it tougher to teach everyone. When there's a clear schism of two groups of ability -- the beginners and the advanced -- one can either favor one group over the others or aim for the middle. The next time I teach Programming in R for Data Science (currently scheduled for Spring 2014) I expect I'll favor the beginners, and provide extra challenges to those who came in with more experience but still insist on taking the class. When I teach Graphics? I can't yet say, but I'm inclined to skew toward the advanced students and either assume knowledge of the R programming language from everyone or set up some kind of pre-class boot camp for those who'd like a boost. Either way, it became clear that learning who was already experienced would have been best right off the bat rather than after a week or two of lectures.

I feel more strongly that meaningful individual feedback is impossible in a lecture setting once we have more than 15 students. In that case, we ought to embrace the MOOC spirit and make lectures and homeworks for consumption by (potentially) millions of people. Spending an hour or more restating what a student can find online is a pure waste of opportunity, and the more I teach, the less I'm inclined to be wasteful.

A Common Narrative Theme Can Bind It All Together

This was the first time I'd taught either version of the class, and it took until week 3 for me to figure out what the central story was for one of them. Teaching programming in a lecture-style is challenging under normal circumstances, but for an audience that's taking it for employment reasons, not enjoyment, there's an additional level of difficulty to match there.

For the programming class, that vision came about once I had more discussions with students and other faculty. As much as I don't like the term "data scientist" to describe a professional -- after all, what scientist doesn't work with data? -- I found that the {\em process of data science} was an enormously helpful way of putting it all together. Each tool the class would learn would fit into the process -- figuring out a question, gathering and processing the data into a convenient form, conducting an exploratory analysis to learn more and a confirmatory analysis to establish whatever was under analysis, and finally reporting on the analysis to whoever was the final recipient.

Once this was establshed, it was far more clear to me how I'd fit everything together, particularly for public policy students, the largest contingent in the class. Here's a set of skills that's more relevant in the real world than ever, and getting a handle on what can be done now, and where to look for improvements in the future.

How I wish I'd been able to do this for my Graphics students! While there were a number of common mathematical themes, the course became more about how to find and use a particular set of tools rather than a common narrative. We could be a little bit looser because the programming expertise was so much higher, and I could be more of an advance scout: find a tool that matched the interests and needs of the students, and help them produce graphical materials that would help them with their employment. Still, it didn't work as well as I would have preferred, particularly since I had to adjust early in the course when the students turned out to be mostly professional master's students rather than the PhD students I'd anticipated.

Now that I know what I could expect the next time around, I'll have a better sense of the narratives that I could tell for either audience. That doesn't mean I wish I could have done better for that group.

Improving Feedback

It's apparent to me now that the overprecision of grades and marks made it really difficult to communicate expectations with students. This came to a head with me when I realized, during a previous course for undergraduates, that they cared more about the number than the material even if it was a matter of two points either way. (It brought back some disturbing high school memories for me as well.)

I had the luxury with these smaller classes to try something different with the grading: that if they got something wrong, they should try as best as possible to revise and correct their error. The more I thought about this, the more sense it made to accept only two grade levels for an assignment: one that was perfect, and one that represented the best possible effort to that point. And so I instituted the policy in both classes that a homework submitted on time and "complete" would be worth 1 point, and that each assignment could be resubmitted up to 3 times until it was judged perfect for a second point. An initial assignment could of course be perfect and not require a resubmission; any student with a perfect score on all assignments for their initial submissions could earn an A+ if it was allowed (which Heinz allowed for the R class but Dietrich did not for Graphics.)

Theoretically, it was great for the instructors and the students -- all the responsibility for a grade would be put on the students because they would know exactly where they stood. It would remove the nature of the tug-of-war from the learning experience and bring it back to where it belonged. I'd be willing to bear any risks of grade inflation because this was a graduate program.

What happened in the end? This system worked well at the low end for people to make improvements, but did little to challenge the best students to excel beyond their own expectations. We should have taken the "perfect" scores and challenged them further on every point -- what else can they improve about a plot? How would they do it in other languages or interfaces? How might they design a randomized experiment to discover the most effective methods of communicating this kind of data? I wanted my students to continually apply themselves to the tasks at hand, but didn't push them to the kind of excellence I know they can be capable of.

That's not to say I have to stop with them; this is a small department, and so I can check in with after the course is done, and help them to develop their own applications. But I do wonder how to make that work better the next time around.

Elsevier Bought Mendeley; Internet Freaks Out; I'm Barely Surprised

2013-04-09T23:07:42Z

I love it when my nerdiest pastime and professional interest -- bibliometrics and academic paper management -- makes the news in a big way. I like it more when it's direct evidence of all the issues that academia faces as a public good.

Mendeley is a "freemium" service for managing collections of academic papers, offering a cloud-based storage service for personal libraries. Its users have considerable affection for the service, whose management team has proclaimed their dedication to the Open Access movement. In the process, and in contrast, the company has built an impressively large database on user activity, one that was kept to itself rather than being available to its users.

Which is why the backlash to its purchase by Elsevier, a company that takes advantage of our public good for its private enirchment, strikes me as extremely naive. Mendeley's supposed commitment to an open access movement was already betrayed by their Facebook-like business model.

I'm less shocked since this is only the latest in a series of "betrayals" by companies supposedly behind principles of openness:

Combine this with the recent rise of "predatory" journals, and you can see why my worry has less to do with any individual companies and much more about the need to solidify the process of scientific communication as a public good.

Resigned To Change

2013-02-04T19:41:42Z

What follows: I resign from two editorial boards on principle. I don't feel heroic about it, but it had to be done.

Last year, I signed the Elsevier boycott as soon as it was announced. I firmly believed at the time that the principles of the boycott were sound: this was a company that had historically charged obscene prices, and made extreme profits, by selling other people's work with cartel-like levels of market control. I knew how this made sense in the past -- as both a filter and a distribution source, academics had little choice but to work with for-profit publishing companies. But now, the situation borders on the absurd. To make an example out of one of the biggest publishers seemed almost automatic, and I joined the official boycott without hesitation, in addition to years of avoiding Elsevier journals to publish my own work.

All that's needed for the system to work without big publishing companies is an environment of open publication, and so I've enthusiastically submitted my work to society journals and others with principles of openness. One of these was the Berkeley Electronic Press (bepress), which as a non-profit electronic publisher, committed to open access, promised a way forward: with the Internet as the ultimate distribution venue, all that would be needed is an editorial structure, handled as it has been by academics, the vast majority of whom work pro bono.

And so I joined two such efforts; first, the nascent journal Statistics, Politics and Policy, still in its infancy, in 2010; and second, the slightly more venerable Journal of Quantitative Analysis in Sports, which (to my delight, as a long time author and reader) I was asked to join roughly a year ago. Both have sterling editorial boards (aside from me) and I've enjoyed my time and efforts with both groups. But things got complicated in September 2011, when for-profit publisher De Gruyter announced that it was buying many bepress journals, including both SPP and JQAS. Originally it seemed as though little would change; my back-channel inquiries suggested that the new bosses wanted to change very little from the original bepress setup, which is why I was comfortable joining JQAS after the transition.
]]>
The sticking point for me came in July when we discussed the company's policy on preprints at the editorial board meetings. If the main contribution of a journal today is to improve the appearance and readability of an article, then it makes sense that a journal should allow preprints of work to be held by sites like arxiv.org, since that increases the incentive to get a fresh, improved copy from a journal website. (In fact, I rather like the idea that copyright in the content should be separate from copyright of the presentation -- it makes it clear who did what work.)

I was surprised to learn the de Gruyter policy on this: that they would prefer no traces of the article as preprint to be in the wild, and that they would permit a personal copy of the final proof to be on an individual's website -- but not on any public archive sites. In today's academic climate, restricting the preprint market like this is detrimental to new science. Individual websites are rough and poorly indexed; public archive sites, claiming only the barest of rights, can handle much of this burden at a low cost. The rise in importance of arxiv.org as a home for department tech reports (including my current employer) accentuates this.

Helping de Gruyter make money off my volunteer work might have advantages for my career, but it's bad for science to keep this kind of power in the hands of for-profit companies when the alternatives are so compelling, and when so much of the funding for our work comes from public sources. I know that it has to be a lot cheaper than the current system; we just need to figure out how to make it work on a grander scale.

In the end, as an early-career scientist, there's little I can do to change the course of the journals from within. I'd rather focus my efforts onto matters that can push academic publishing towards more open publication -- keeping more rights for authors and the people who pay for the research, and. And so I decided to resign from the boards of both SPP and JQAS as of the end of last year, not because of any of the people involved -- I have warm feelings for both editorial boards, and I personally like the de Gruyter reps we've worked with -- but because it's counterproductive for me to continue in that capacity.

Update: To my astonishment, I received notice this week that the senior editors of SPP are all resigning from the journal and are trying to found a new, similar journal under the auspices of the American Statistical Association. I'm fairly certain that my resignation didn't push anyone on this, but it was very comforting to know that they had similar misgivings to the arrangement as I did.

The Statistical Properties of the Electoral College Are Perfectly Bearable

2013-01-28T17:17:08Z

What follows: I give a not-so-ringing endorsement of the Electoral College, by showing that the current mode has reasonable partisan symmetry. I'd still prefer a scheme with the national popular vote, but what we've got ain't so broke.

Andrew Gelman, Gary King, Jonathan Katz and I published an article on the Electoral College just in time to miss the 2012 US Presidentlal election (here from SSRN and here from the journal website) but apparently just in time to catch the reactions of people complaining about how the election went. Last week, news broke that a group of Virginia politicians wanted to reapportion their state's electoral votes by congressional district, echoing similar attempts in Pennsylvania in 2012 and California in 2008, making it clear that the issue isn't going away any time soon.

In brief, we quantified how much partisan bias there has been in the electoral college system as it stands today (essentially none), if certain states reapportioned in this matter (it depends on the state), and if all states did so (it would have been substantially biased towards the Republicans). In extending the analysis for this post, we find that the Electoral College had no meaningful partisan bias in the 2012 election either.
]]> of course it is) but what its effect would be on the entire system if more states did this. More to the point, we wanted to check the state of the system as it was at each election, according to the simple question of partisan symmetry:

If one party in a two-party system receives X% of the vote and Y% of the seats, then in the hypothetical situation that the other candidate receives X% of the vote, they should also receive Y% of the seats.

Replace "party" with "presidential candidate" and "seats" with "electoral votes" and you get to the heart of it; see the paper for details on how we estimate partisan bias. The easiest application of this is if the overall popular vote is tied, then they should expect to receive an equal number of electoral votes. Is this condition present now? Would it be if California or other states split their votes by district? What if every state did it that way?

The paper contains an analysis for each election between 1956 and 2008; for this post, I re-ran the analysis adding preliminary data from 2012 (with a little imputation for as-yet unreported districts) and calculated the effective partisan bias for the election. Zero indicates a symmetric system; a bias of 1 (or -1) would indicate that if the vote were split evenly, the Democratic (or Republican) candidate would win all of the electoral votes available.

As the system stands right now, things are basically fair, and have been from 1980 onwards -- closest to zero is fairest:

If California's electoral votes had been split by congressional district, there would have been some interesting consequences -- not nearly enough of a bump in 1980 to re-elect Jimmy Carter, but a consistent Republican edge ever since.

Suppose we counterweighed this by changing historically Republican Texas to Congressional district apportionment. It would have helped, but not nearly enough to balance the scale:

Now, suppose every state split their electoral votes by Congressional district. The edge is consistently Republican, even today:

In the end, even if other states counter-balanced each other to try and even things out, it would probably make things worse. As things stand, the status quo of the Electoral College is adequate without any kind of large-scale modification, so far as we can predict.

Digital Publishing Isn't Harming Science, It's Liberating It

2012-11-27T04:20:48Z

It's somewhat appropriate that a complaint from a scientific authority on the decay of scientific publishing should be circulated on the Huffington Post, whose legions of unpaid bloggers gain only exposure for their efforts; how closely it parallels the history of scientists, working without pay, as both content producers and vetters, and what it means for the future. Douglas Fields' comment on scientific publishing (thanks, Simply Statistics!) has the facts right, but the conclusions he draws are contradicted by the very nature of the system he's trying to assault.

The key to it all is the nature of peer review:
]]>

A scientific discovery is useless if it is not communicated with authority to the scientific community. For centuries scientists submitted their research findings for publication in scientific journals that were run by the leading scientists with expertise in a specialized field who served as journal editors. The editors evaluated the submission, and if the findings appeared to be important and technically sound, they sought out other scientists around the world with recognized expertise in the area to read the manuscript critically and advise the editor and authors (anonymously) on its suitability for publication.
This process is essential to root out poor science and pseudoscience, and to prevent bogging down the advancement of science by cluttering the literature with contradictory and erroneous findings. The expert peer reviewers evaluated the potential strengths, weaknesses, technical flaws, significance and novelty of the finding, and they suggested the need for further experiments. If the study failed to be accepted for publication by the editor, the authors benefited from the editorial review process, and they revised their work for submission to another journal.

I'm with you, sir! This is the beauty of the peer review system, and the source of it isn't the paper it's printed on; it's the stamp of approval of the editorial board that matters. A quality board is a collection of distinguished members with noteworthy professional experience, combined with their past record of approving meaningful publications.

Recent government-mandated changes in scientific publishing are undermining this critical process of validation in scientific publication.

And now he's lost me. Scientific validation is carried out by the editorial board and its referees -- the vast majority of whom are unpaid volunteers -- and abetted by a publisher, not controlled by one. The sharp drop in publishing costs from online publishing will only put more control in the hands of the academics who decide what's truly important.

The first change to which he speaks -- a mandate that all papers should be openly accessible for all readers, if their research was funded by federal grants -- affects publishers, not editorial boards. While Fields defends the necessity of the publisher as the producer, editor and disseminator of research, he seems to underplay his own role as the editor-in-chief of a journal, one with the responsibility of seeking out the editorial board, ensuring the quality of the process, and so forth. It's true he doesn't copy-edit or type-set, but these tasks are getting cheaper all the time, and arguably, current publishers don't do that great a job of it.

Fields is also conflating the two major models of Open Access publishing: "Green OA", which says that authors should archive their preprints on public sites (at little cost) is the PubMed approach, and doesn't take away whatever value that copy-editing, type-setting and large-scale printing adds."Gold OA", in which authors pay the publishers for the dissemination of their work, is the model pursued both by top-quality outfits (including CUP) and spamming bottom-feeders. That's why his second point -- that electronic publishing decreases the cost barriers to entry -- is on the mark. But I'm baffled by what follows in his personal testimonial:

Neuron Glia Biology was a scientific journal that was launched in 2004 by me and like-minded scientists to advance scientific research on neuron-glia interactions, and it was published by Cambridge University Press until this year. Neuron Glia Biology provided the opportunity for 1,400 authors to introduce their new research on neuron-glia interactions into the scientific literature, and it helped advance a new field of science, but no longer.

Again, I say: this commentary was published on the Huffington Post. For free. Whether or not it was more visible because of this service, the real stamp of approval comes not from being on this website, but from your peers in the community who judge your work. And those 1,400 authors will not stop writing, the editorial board of Neuron Glia Biology will still believe in their mission, and if it comes to it, finding an online-only home for a format won't change that -- I know it's easier in the mathematical sciences, but biology isn't far behind. The success of the enterprise comes down to the acceptance of the community first.

Vanity journals might be going for a money grab, but so are Elsevier and Nature, both of which are hideously profitable thanks to their monopolistic tactics and reliance on free labour -- not to mention that CUP seems to be doing all right for itself. The pressure from the community is exactly why I doubt that most scholars will fall for bad articles in true vanity journals -- and thanks to the exact peer forces that propel academia, if they do get any attention, the end result will be a humiliated, slightly poorer academic, not the end of the discipline as we know it.

I sympathize with Dr. Fields' anxieties about the state of academic publication today, but I'm far more excited by the premise of technology to keep things fresh than I am about a corporate/government takeover of science. We just have to remember that we're still the ones in charge.

538's Uncertainty Estimates Are As Good As They Get

2012-11-07T13:01:35Z

(or, in which I finally do an analysis of some 2012 election data)

Many are celebrating the success of the poll aggregators who forecasted the states won by each candidate -- many called all 50 right, including FiveThirtyEight. No doubt Nate Silver will continue to be the world's most famous meta-analyst given this accomplishment -- even though several of his peers, such as the Princeton Election Consortium, Votamatic and Simon Jackman's projections for the Huffington Post, seemed to do equally well. The strength and depth of the number of polls in swing states no doubt had a lot to do with all their successes.

How much of an accomplishment this is, of course, depends on context; the winner in most states was easily predicted ahead of time with the barest minimum of polling. Consider instead a related question: how close were the vote shares in each state to the prediction, as a function of the margin of error?

The simplest way to check this is to calculate a p-value for each prediction: for each prediction and its associated uncertainty, calculate the probability that the observed value (vote share) is greater than a simulated draw from this distribution. The key is that for a large number of independent prediction-uncertainty pairs, we should see a uniform distribution of p-values between 0 and 1.

I grabbed the estimates from FiveThirtyEight and Votamatic (at this time, I have only estimates, not uncertainties, for PEC or HuffPost) and calculated the respective p-values assuming a normal distribution in each case. Media coverage suggested that Nate Silver's intervals were too conservative; if this were the case, we would expect a higher concentration of p-values around 50%. (If too anti-conservative, the p-values would be more extreme, towards 0 or 1.)

On the contrary, the 538 distribution is nearly uniform. The closer the points are to the diagonal, the better the fit to the uniform:

Repeating the process for Votamatic:

The values are pushed towards zero and one, so the confidence intervals are far too tight: the Votamatic predictions turned out to be too overly precise.

The data I used are here. (I read the Votamatic intervals directly off the graphs; if I can get a more precise value, I'll repeat the analysis.) I'm very curious to know how the other meta-pollsters did, so if anyone has put together that data, please send it my way.

The Journal System and Statistical Publishing

2012-10-11T05:35:55Z

David Banks has some notions about how to evolve the peer review system, specifically for publishing in statistics. Not surprisingly, I agree with him about most things, namely the rise of the Internet as giving rise to many more creative options for outlet.

One of the trickier things to figure out is whether or not article quality would be upheld under a new system. Quoth Banks:

Article quality can be signaled in multiple ways, either by conventional review or by ungameable rating systems, similar to page-ranking algorithms.

Conventional review has its benefits, but I'm not sure we have a good way of instituting this yet. And no system is ungameable, even PageRank (think "miserable failure"), but as long as there's effort put into it by the community, there's hope.

Scrabble Cheating

2012-08-16T05:53:51Z

News of a cheating scandal in Scrabble has rippled through the community, after a competitor (proverbially) hid the blanks up his sleeve during matches, leading to his subsequent disqualification. As he is a minor, his name is not being shared, thereby preventing us from asking why if he was going to cheat, why he couldn't have done a better job of it.

Let me take the opportunity to remind tournament organizers everywhere that the latent tile order design mechanism could have prevented this travesty from happening. And all they would have had to do was spend tens of thousands of dollars to design and build the physical apparatus to make it happen, and tens of thousands more to outfit the entire tournament with them. But in the long run, shouldn't we do everything we can for the children?