<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0"><channel><title>Sunlight Labs blog</title><link>http://sunlightlabs.com/blog/</link><description>Latest blog updates from the nerds at Sunlight Labs</description><language>en-us</language><lastBuildDate>Wed, 01 Sep 2010 12:41:17 -0400</lastBuildDate><ttl>120</ttl><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/sunlightlabs/blog" /><feedburner:info uri="sunlightlabs/blog" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item><title>Introducing the Open State Project API</title><link>http://feedproxy.google.com/~r/sunlightlabs/blog/~3/AcafrtO6feI/</link><description>&lt;p&gt;Over a year ago we &lt;a href="http://sunlightlabs.com/blog/2009/fifty-state-project/"&gt;announced&lt;/a&gt; our intention to build scrapers that would collect and sanitize legislative information from all fifty states, an initiative that is now known as the &lt;a href="http://openstates.sunlightlabs.com"&gt;Open State Project&lt;/a&gt;.  (formerly the Fifty State Project)&lt;/p&gt;
&lt;p&gt;Since we put out the proposal we've had more than 25 developers contribute code, and we now have scrapers in various states of completion for approximately 30 states.  Soon after beginning the project we learned that collecting the data isn't enough, we have found that after the scrapers run there is still work to be done: name standardization, adapting for different naming conventions across states, and attempting to match legislators to their IDs on websites such as &lt;a href="http://followthemoney.org"&gt;FollowTheMoney.org&lt;/a&gt; and &lt;a href="http://votesmart.org"&gt;Project Vote Smart&lt;/a&gt;.&lt;br /&gt;
&lt;/p&gt;
&lt;p&gt;As of today we're proud to announce a new milestone for the project, version 1 of the &lt;a href="http://openstates.sunlightlabs.com/api.html"&gt;Open State Project API&lt;/a&gt;.   You can start using our API today to get access to information on more than 37,000 bills and 1,600 legislators from the most recent sessions of 10 state legislatures.&lt;/p&gt;
&lt;h4&gt;About the API&lt;/h4&gt;
&lt;p&gt;Our API makes it possible to get at all of the data that we currently collect.  For most states this means legislators, committees, and bills including their actions, votes, and links to full text versions.  We've spent time building a flexible infrastructure that allows us to collect extra data where it is available and pass it on, while still providing a common subset across all states we provide.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://openstates.sunlightlabs.com/api.status.html"&gt;
&lt;img src="http://assets.sunlightlabs.com.s3.amazonaws.com/fiftystates/apimap.png" class="detailimage"&gt;
&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;If you visit our &lt;a href="http://openstates.sunlightlabs.com/api.status.html"&gt;API status page&lt;/a&gt; you'll see that we are launching with  five states that we consider "ready" (shown in green).  These five states: Maryland, Texas, Wisconsin, Louisiana, and California are our trial five states and while there's no such thing as perfect data we have made a commitment to keeping these five as up to date as possible and will give the highest priority to any data quality issues we encounter in them.&lt;/p&gt;
&lt;p&gt;We are also making data for five additional states available in an experimental state (shown in orange).  We believe that the data for these states (Vermont, North Carolina, South Dakota, Pennsylvania, and Nevada) is of high quality and ready for public use but haven't had the time to vet it quite as thoroughly as data from states deemed "ready".  If you decide to use data from these states don't be surprised if you notice a few more rusty edges in the data for these experimental states.&lt;/p&gt;
&lt;p&gt;If you're ready to get started with the API all you'll need is a &lt;a href="http://services.sunlightlabs.com"&gt;Sunlight API Key&lt;/a&gt; (if you already have a key for our &lt;a href="http://services.sunlightlabs.com/docs/Sunlight_Congress_API/"&gt;Congress API&lt;/a&gt;, &lt;a href="http://transparencydata.com/api/"&gt;Transparency Data&lt;/a&gt;, or &lt;a href="http://services.sunlightlabs.com/docs/Drumbone_API/"&gt;Drumbone&lt;/a&gt; you can use that).&lt;/p&gt;
&lt;p&gt;After that you may find the following links to be helpful:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://openstates.sunlightlabs.com/api.html"&gt;Open State API Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://github.com/sunlightlabs/python-openstates/"&gt;Python Client Library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://code.google.com/p/openstates/issues/list"&gt;Open State Project Issue Tracker&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://groups.google.com/group/fifty-state-project"&gt;Open State Project Google Group&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Finally, we'd like to acknowledge that given the size of this undertaking and this project wouldn't be possible without all of the volunteers that have helped by contributing code.  A special thank you to contributors Michael Stephens, Mark Olson, 2010 Summer of Code student &lt;a href="http://sunlightlabs.com/blog/2010/gsoc-2010-openstates/"&gt;Gabriel Joel Pérez&lt;/a&gt;, and former Sunlight Labs intern &lt;a href="http://schneidy.com/blog/2010/08/09/summer-in-sunlight/"&gt;Dan Schneiderman&lt;/a&gt; as well as everyone else in &lt;a href="http://github.com/sunlightlabs/fiftystates/blob/master/AUTHORS"&gt;the AUTHORS file&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you'd like to help us get the next 40 states ready join the &lt;a href="http://groups.google.com/group/fifty-state-project"&gt;Open State Project Google Group&lt;/a&gt; and introduce yourself, we're always looking for help and will be happy to help you find a place your skills will benefit the project.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/sunlightlabs/blog/~4/AcafrtO6feI" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">James</dc:creator><pubDate>Wed, 01 Sep 2010 12:41:17 -0400</pubDate><guid isPermaLink="false">http://sunlightlabs.com/blog/2010/introducing-open-state-project-api/</guid><feedburner:origLink>http://sunlightlabs.com/blog/2010/introducing-open-state-project-api/</feedburner:origLink></item><item><title>Better Living Through Transparency: The Importance of Models</title><link>http://feedproxy.google.com/~r/sunlightlabs/blog/~3/dJwO7oszxog/</link><description>&lt;img alt="Craniometry: A human skull and measurement device from 1902." title="Craniometry: A human skull and measurement device from 1902." src="http://upload.wikimedia.org/wikipedia/commons/thumb/f/fb/Craniometry_skull_1902.jpg/200px-Craniometry_skull_1902.jpg" class="detailimage_right"&gt;

&lt;p&gt;At Sunlight we spend a lot of time exploring ways to open up data sets and make them more accessible. The idea is that data enables us to act collectively, making better informed decisions and building a more effective public sector. When we talk about transparency the focus is often on the possibilities that data offers. But this discussion sometimes ignores the fact that translating data into action is hard. &lt;/p&gt;
&lt;p&gt;There's a reason for this: data alone doesn't provide answers. &lt;/p&gt;
&lt;p&gt;Coming up with solutions to real life problems -- like designing an effective and fair tax code or improving health care -- requires an understanding of how real life works. Unfortunately, more often than not real life is messy and complicated. In order to make sense of this complexity we need models -- approximations of the world that define fundamental mechanics of a given process and reduce it to understandable and meaningful terms. &lt;/p&gt;
&lt;p&gt;As Joshua Epstein writes in a &lt;a href="http://www.brookings.edu/articles/2008/1031_model_epstein.aspx"&gt;clever essay on scientific inquiry&lt;/a&gt;, every time we use data to draw a conclusion we also use a model. Sometimes explicitly: when a meteorologist makes a prediction about the weather they use a rigorously designed framework for translating observational data into a forecast. Sometimes not: when I look at the sky and make a prediction I'm using an implicit model based on a mix of past experience and a rather poor understanding of atmospheric processes. Both of us are using models to interpret data and both are based on assumptions about how weather works. I'm just not sure I could explain how mine functions, nor do I have any sense of how well it works.&lt;/p&gt;
&lt;p&gt;Having access to good observational data is incredibly important to arriving at useful answers. But well designed and transparent models are equally important. In fact, having a good model is often a prerequisite to determining what to observe and how. If I want to predict the weather should I measure the temperature? Pressure? Wind direction? Where and how frequently? Without a solid theoretical framework it's often impossible to know where to begin and it's even harder to know when I've made a wrong turn. &lt;/p&gt;
&lt;p&gt;When we use a model we embed its assumptions into the results. If key assumptions are incorrect, good data turns into supporting evidence for a potentially misguided answer. Or a bad model might drive the collection of useless data.&lt;/p&gt;
&lt;p&gt;It's rare to find a problem where the proper set of assumptions is obvious and agreed upon by everyone involved. But the higher the stakes, the more important it becomes that we get those assumptions right.&lt;/p&gt;
&lt;p&gt;A case in point: the LA Times' release of an &lt;a href="http://www.latimes.com/news/local/teachers-investigation/"&gt;analysis of the performance of individual teachers&lt;/a&gt; in the Los Angeles Unified School District (LAUSD). The paper's database names over six thousand elementary school teachers and provides a model, known as valued-added measurement (VAM), for evaluating their impact in the classroom. The model compares a student's test scores against the previous year's performance, and from the difference calculates the "value" that a specific teacher added over the year.&lt;/p&gt;
&lt;p&gt;On one hand this project is a tremendous validation of the transparency movement and an enterprising piece of journalism. The Times was able to build its database thanks to the disclosure of seven years' worth of student testing data -- just the kind of granular, high-value data that we hope to see released throughout government. And there's little question about the importance of such performance reviews. By many accounts, the LAUSD has serious problems. Evaluating teacher performance is considered a key component of reform, and the school district's own review has been bogged down by a debate with its union. The availability of testing data allowed the Times to analyze teacher performance directly, bypassing gridlock within the school system and empowering the public to hold individual teachers accountable for their performance.&lt;/p&gt;
&lt;p&gt;On the other hand, there's a reason why the school system has been slow to develop its own evaluation system: developing an objective, robust model for evaluating teacher performance is incredibly hard. Developing an objective, robust model that everyone agrees on is nearly impossible. &lt;/p&gt;
&lt;p&gt;The value-added metric employed by the Times is probably helpful in assessing performance. A substantial body of research has indicated that this sort of modeling is a potentially useful indicator of a teacher's impact in the classroom. However, it is based on several assumptions, including: &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Standardized test scores in math and English are useful indicators of student achievement;&lt;/li&gt;
&lt;li&gt;Year-over-year changes in student scores are an accurate measure of a teacher's impact;&lt;/li&gt;
&lt;li&gt;In aggregate, examining changes in an individual student's scores reduces the need to control for the unique educational challenges students might face; &lt;/li&gt;
&lt;li&gt;Value-added measurements allow comparisons between teachers and schools with vastly different student populations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The implications of these assumptions are by no means agreed on within the educational community. As a result there are a number of different ways to do value-added measurement and it's not a settled matter which is the best. Even in the ideal case these methods offer a spectrum of certainty based on the quality of the inputs.&lt;/p&gt;
&lt;p&gt;More significantly, there are open questions about how to use these techniques in high-stakes situations. Very few school districts currently employ value-added measurement in their performance reviews. Where it is used, it is only a part of a larger evaluation process -- in many cases it contributes about one third of a teacher's final score.&lt;sup&gt;1&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;The Times mentions these concerns in its reporting but takes the stance that releasing the results from their valued-added model is a useful step forward, even if it is an imperfect or incomplete measure. Because the district was not proactive in developing these methodologies itself some kind of public intervention is now required. &lt;/p&gt;
&lt;p&gt;It seems clear &lt;em&gt;something&lt;/em&gt; needs to be done, but it's important to recognize exactly what the Times is contributing to this debate. They're not just releasing data, they're taking a stand on how to evaluate teacher performance. This distinction is important and perhaps not immediately obvious.&lt;/p&gt;
&lt;p&gt;For example, Kevin Drum at Mother Jones &lt;a href="http://motherjones.com/kevin-drum/2010/08/rating-las-teachers"&gt;defended the paper's actions&lt;/a&gt; by saying, "The data is public, and either you believe that the press should disseminate public data or you don't." He later &lt;a href="http://motherjones.com/kevin-drum/2010/08/testing-kids-testing-teachers"&gt;revised that comment&lt;/a&gt;, saying he meant that it's the paper's right and responsibility to "disseminate &lt;em&gt;meaningful&lt;/em&gt; public data."&lt;/p&gt;
&lt;p&gt;I don't disagree with those assertions. But, at least in this instance, I think the premise is flawed -- if this were only about dissemination of public data there would be no debate.  What's at issue is the definition of "meaningful." Because the district failed to develop a definition itself it is now up to the paper and its readers to decide if value-added measurement offers meaningful results. Drum thinks this is ok:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If the tests really are poor indicators of short-term student performance, perhaps this project will make that clear. Parents, principals, and fellow teachers probably have a pretty good sense already of who the good and bad teachers are, and if the value-added testing metric used by the Times turns out to be wildly at variance with this sense, it should provoke a serious rethink.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This seems to be a likely outcome but an unfortunate one as well. We're at once arguing that we need objective, quantitatively driven metrics for analyzing performance and at the same time in the absence of agreement on how those metrics work we will use the informal opinions of parents and colleagues (an implicit and arguably flawed model) to calibrate our approach. &lt;/p&gt;
&lt;p&gt;We can and should expect better.&lt;/p&gt;
&lt;p&gt;The same quantitative tools that allow us to build a value-added measurement system also allow us to interrogate and refine our methodologies. Unfortunately this sort of process is often seen as inconveniently technical and at odds with the with the desire for concrete answers. Jay Mathews at the Washington Post &lt;a href="http://voices.washingtonpost.com/class-struggle/2010/08/la_times_testing_series_raises.html"&gt;offers an interesting discussion&lt;/a&gt; about how this tension can play out in journalistic settings and importance of getting it right in cases like the Times' database.&lt;/p&gt;
&lt;p&gt;Those of us in the transparency movement have a responsibility to illuminate not only the data, but the ways in which data are collected and analyzed. After all, our goal is about more than providing an answer, it is about making transparent the processes though which answers are derived. &lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;For an overview of the debate regarding valued-added measurement, see "&lt;a href="http://216.78.200.159/Documents/RandD/Other/Getting%20Value%20out%20of%20Value-Added.pdf"&gt;Getting Value Out of Value-Added&lt;/a&gt;" from the National Academies Press. Or see &lt;a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2010/08/more_on_those_l.html"&gt;this&lt;/a&gt; for an interesting discussion of some of the methodological implications involved.&lt;/li&gt;
&lt;/ol&gt;&lt;img src="http://feeds.feedburner.com/~r/sunlightlabs/blog/~4/dJwO7oszxog" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Kevin Webb</dc:creator><pubDate>Mon, 30 Aug 2010 18:12:29 -0400</pubDate><guid isPermaLink="false">http://sunlightlabs.com/blog/2010/better-living-through-transparency/</guid><feedburner:origLink>http://sunlightlabs.com/blog/2010/better-living-through-transparency/</feedburner:origLink></item><item><title>Google Summer of Code: Open State Project</title><link>http://feedproxy.google.com/~r/sunlightlabs/blog/~3/Vxu3zAirRT8/</link><description>&lt;p&gt;&lt;em&gt;This post was contributed by one of Sunlight Labs' Google Summer of Code Students, Gabriel Joel Pérez. Gabriel's work is currently being integrated into the core project and the states he has been working on should be available via the &lt;a href="http://openstates.sunlightlabs.com/api.html"&gt;Open State Project API&lt;/a&gt; later this year.  His code is available on &lt;a href="http://github.com/climatewarrior/fiftystates"&gt;github&lt;/a&gt; as we work on integration.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Hello! I’m Gabriel, I’m a 4th year student of Computer Engineering from the University of Puerto Rico in Mayagüez. This summer I worked as a GSoC student on developing new scapers for the &lt;a href="http://openstates.sunlightlabs.com"&gt;Open State Project&lt;/a&gt;. The states I worked on were Colorado, Hawaii, Washington, Oregon and the territory of Puerto Rico. I really enjoyed the whole experience. The work is very fulfilling as coding in Python is always delightful and fun. &lt;/p&gt;
&lt;p&gt;Writing scrapers can pose a series of problems. The Internet is full of inconsistent, unstructured, and badly written html. Thankfully the &lt;a href="http://codespeak.net/lxml/lxmlhtml.html"&gt;lxml.html&lt;/a&gt; library is very good at handling all kinds of html and it does so quite fast. Also it has a powerful and well documented API that makes the scraping work a whole lot easier. But still one can be hurt by the woes of inconsistencies. For example sometimes different styles of html are used for different years. Also sometimes the way the html is structured doesn’t help at all with the scraping and one has to resort to regular expressions and other techniques. &lt;/p&gt;
&lt;p&gt;At first it was more difficult for me to write the scrapers. Something that helped me out a lot at first was looking at the other available states to see how they dealt with some recurring problems.  Also I constantly checked the lxml.html, Python and Open States documentation. All of them have great documentation and never had much of a problem on that front. Later I got more accustomed to the process and things went quite smoothly. &lt;/p&gt;
&lt;p&gt;I want to definitely continue contributing to the Open State Project however I can. Sunlight is a great organization that has a very important mission and the Open State Project is definitely one of its most important projects. I would like to thank my mentor James Turk for all of his help and for being so cool and accommodating. Also I would like to thank Google for helping the FOSS community and students through the &lt;a href="http://code.google.com/soc/"&gt;Google Summer of Code program&lt;/a&gt;.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/sunlightlabs/blog/~4/Vxu3zAirRT8" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">James</dc:creator><pubDate>Mon, 30 Aug 2010 15:10:18 -0400</pubDate><guid isPermaLink="false">http://sunlightlabs.com/blog/2010/gsoc-2010-openstates/</guid><feedburner:origLink>http://sunlightlabs.com/blog/2010/gsoc-2010-openstates/</feedburner:origLink></item><item><title>Preparing for the Worst</title><link>http://feedproxy.google.com/~r/sunlightlabs/blog/~3/0pkjkmYKluw/</link><description>&lt;p&gt;I should say up front that Google's been a great friend to Sunlight: &lt;a href="http://sunlightfoundation.com/funding/"&gt;they've helped support our contests&lt;/a&gt;, they've sent us phones and Summer of Code students to help our Android development efforts, and when I visited their DC offices a couple of weeks ago they let me eat as much candy as I wanted.&lt;/p&gt;
&lt;p&gt;Still, I'd be lying if I said the incredible scope of their success didn't make me a little uneasy.  We use Google Apps for our work email, for instance, and YouTube is essential to our video production efforts. We're as dependent as anyone else on Google for search, both as a tool and a source of traffic.  I know we're not the only ones to be a bit unnerved at being so reliant on the goodwill of private enterprise -- and of course over the past few weeks, other voices expressing those concerns have become significantly louder.&lt;/p&gt;
&lt;p&gt;So, while we're looking forward to continuing to work with Google, it would be irresponsible for us not to prepare for the unthinkable.  I'm happy to say that we've taken the necessary precautions, and today the future seems a bit less uncertain:&lt;/p&gt;
&lt;p&gt;&lt;object width="640" height="385"&gt;&lt;param name="movie" value="http://www.youtube.com/v/K7C9U4GwUb0?fs=1&amp;amp;hl=en_US"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/K7C9U4GwUb0?fs=1&amp;amp;hl=en_US" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="640" height="385"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/p&gt;
&lt;p&gt;Of course, what happens after we run through our 1000 free hours is anyone's guess.&lt;/p&gt;
&lt;p&gt;(Many thanks to Pierre Huggins of &lt;a href="http://roxchox.com"&gt;Rox Chox &amp;amp; Blox Woodworking&lt;/a&gt; for lending his awesome fabrication capabilities to this ridiculous project (and to our own sysadmin extraordinaire, Tim, for finding Pierre via &lt;a href="http://hacdc.org"&gt;HacDC&lt;/a&gt;)&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/sunlightlabs/blog/~4/0pkjkmYKluw" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Lee</dc:creator><pubDate>Fri, 27 Aug 2010 00:23:23 -0400</pubDate><guid isPermaLink="false">http://sunlightlabs.com/blog/2010/preparing-for-the-worst/</guid><feedburner:origLink>http://sunlightlabs.com/blog/2010/preparing-for-the-worst/</feedburner:origLink></item><item><title>Google Summer Of Code Adds New Goodies To Congress (Android App)</title><link>http://feedproxy.google.com/~r/sunlightlabs/blog/~3/RhpyXQOAnp0/</link><description>&lt;p&gt;&lt;em&gt;Over the past few months we've had the pleasure of working with several developers through Google's &lt;a href="http://code.google.com/soc/"&gt;Summer of Code&lt;/a&gt; program.  One of them is Evelina Vrabie, who has contributed her talents to our &lt;a href="http://sunlightfoundation.com/android/congress/"&gt;Android app&lt;/a&gt; (and has done so from across an ocean -- Evelina's based in Romania).  She was nice enough to write about the experience, and to tease a few of the features she's been working on for the app.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;My name is Evelina Vrabie, and for the last four months I've had a great experience collaborating with Eric Mill on the Congress project for Android, as part of the Google Summer of Code 2010 program.&lt;/p&gt;
&lt;p&gt;Working for the Sunlight Foundation has been an excellent opportunity for me to learn and grow as a capable Android developer. Although at first I was a little inexperienced in the world of mobile applications and a bit overwhelmed by the challenge, I can positively say that four months later I've gradually overcome all the obstacles. I had a great mentor, that's for sure! :) I've learned a lot from Eric and I'm really grateful he was patient enough to help me throughout the project.&lt;/p&gt;
&lt;p&gt;At the beginning, I wasn't actually sure of my chances of being selected this year on GSoC. I started talking to Eric at the beginning of the program, trying to understand the motivations of the project, and to get some additional insight about the community. I can honestly say I had my stomach clenched for a few days in anticipation before the final announcements about selected students, but everything turned out to be even better than expected.&lt;/p&gt;
&lt;p&gt;What helped me the most with getting the hang of the code was talking to Eric as often as it was necessary, because he's the live documentation of the Congress project; kidding, but we should continue our efforts to write good project documentation and best practices for future developers. It wasn't difficult at all to familiarize myself with the application; the code is clear and easy to understand, and I tried my best to keep it that way while adding my own contribution. &lt;/p&gt;
&lt;p&gt;I've implemented some nice and useful features so far, like improving the way legislators are fetched according to the user's GPS location, a "favorites" system for bills and legislators, and a notification system that should be finished by the end of the project. I'm planning to add a nice timeline to better visualize the history and votes of a bill, and I'll also be glad to help Eric with some more UI improvements in the near future.&lt;/p&gt;
&lt;p&gt;GSoC may be over soon, but I'm really hoping my contribution to the Congress application will go beyond that. I'm having a great time doing what I like (Android programming) and even though I'll probably need to get a full time job after the end of the summer, I'll be sure to stay in touch with Eric and the Sunlight Foundation.&lt;/p&gt;
&lt;p&gt;Thank you all for having me around!&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/sunlightlabs/blog/~4/RhpyXQOAnp0" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Evelina Vrabie</dc:creator><pubDate>Mon, 16 Aug 2010 18:10:29 -0400</pubDate><guid isPermaLink="false">http://sunlightlabs.com/blog/2010/gsoc-adds-new-goodies-to-congress/</guid><feedburner:origLink>http://sunlightlabs.com/blog/2010/gsoc-adds-new-goodies-to-congress/</feedburner:origLink></item><item><title>The National Data Catalog Is Hungry</title><link>http://feedproxy.google.com/~r/sunlightlabs/blog/~3/Vttid7VJBdc/</link><description>&lt;p&gt;So you've found some government data on the web. Naturally, you are eager to share your findings with the world. Perfect! Sunlight Labs can help. Our &lt;a href="http://nationaldatacatalog.com"&gt;National Data Catalog (NatDatCat)&lt;/a&gt; is hungry for government data, and we have to feed it regularly. Otherwise, it gets grumpy.&lt;/p&gt;
&lt;p&gt;The first step is to assess what you've found. If it is just a few bits of scattered files, just &lt;a href="http://nationaldatacatalog.com/suggest"&gt;fill out a quick form and tell us about it&lt;/a&gt;. On the other hand, if it is a collection of data sets, you might consider writing an importer...&lt;/p&gt;
&lt;h3&gt;Writing an Importer for NatDatCat&lt;/h3&gt;
&lt;p&gt;Have Ruby and Git skills and a hankering for some Web spelunking? Then writing an importer for NatDatCat might be a perfect &lt;a href="http://thechangelog.com/post/382418778/episode-0-1-3-civic-hacking-with-luigi-montanez-and-jere"&gt;civic hacking project&lt;/a&gt; project for you!&lt;/p&gt;
&lt;p&gt;Since the NatDatCat system is centered around a RESTful API, it is easy to write small standalone programs to work with the data. (Even the Web app is, more-or-less, a presentation layer that communicates through the API.) So, to write your importer, you could integrate with the NDC API directly. We have &lt;a href="http://api.nationaldatacatalog.com/docs/"&gt;API documentation&lt;/a&gt; at your service to get you started.&lt;/p&gt;
&lt;p&gt;But not so fast. There is a better way. We recommend using the &lt;a href="http://github.com/sunlightlabs/datacatalog-importer"&gt;NDC importer framework&lt;/a&gt;. The framework serves two major purposes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;It simplifies the task of writing an importer. In particular, the importer framework handles the API communication, so all an importer has to do is handle the external translation step (such as scraping of a Web site or integration with an API). It also provides utility functions that come in handy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It standardizes importers. This encourages the sharing of best practices
and it also makes coordination easier. The various importers are automated
through the use of the &lt;a href="http://github.com/sunlightlabs/datacatalog-imp-system"&gt;National Data Catalog Importer System&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The importer framework is good at doing a few things and delegating the rest. This document will help you get started. Before long, you'll have an importer ready to liberate government data.&lt;/p&gt;
&lt;p&gt;As a prerequisite, you'll need to install the &lt;a href="http://github.com/sunlightlabs/datacatalog-api"&gt;NatDatCat API&lt;/a&gt; on your system. Doing so lets you test your importer locally in a controlled environment. (Once you get your importer working, let us know and we'll add it to our collection of importers that run against our production API.)&lt;/p&gt;
&lt;h3&gt;Importer Walkthrough&lt;/h3&gt;
&lt;p&gt;Let's take a look at some example code in the &lt;a href="http://github.com/sunlightlabs/datacatalog-importer/tree/master/example/"&gt;example&lt;/a&gt; folder.&lt;/p&gt;
&lt;h4&gt;1. Setup the Rakefile&lt;/h4&gt;
&lt;p&gt;Begin by looking at &lt;a href="http://github.com/sunlightlabs/datacatalog-importer/blob/master/example/rakefile.rb"&gt;example/rakefile.rb&lt;/a&gt;. In this file, you set some configuration information and call out to the importer framework. It will define some rake tasks for you. &lt;/p&gt;
&lt;p&gt;The importer framework handles quite a few things for you provided that you follow its design correctly. Your importer is responsible for providing a Puller class (as defined with &lt;code&gt;:puller =&amp;gt; Puller&lt;/code&gt; in rakefile.rb).&lt;/p&gt;
&lt;h4&gt;2. Make Some Keys and Hide Them&lt;/h4&gt;
&lt;p&gt;Use the API to generate a key for your importer. Remember, API keys are private, so please don't store them in code. Actually, don't even store them in source control at all. Separate them out and store them in &lt;code&gt;config.yml&lt;/code&gt;. Make sure that your &lt;code&gt;.gitignore&lt;/code&gt; file is setup to ignore &lt;code&gt;config.yml&lt;/code&gt;.  It is a good idea to include &lt;code&gt;config.example.yml&lt;/code&gt; that demonstrates the format of the file.&lt;/p&gt;
&lt;h4&gt;3. Make the Puller&lt;/h4&gt;
&lt;p&gt;Next, let's look at the &lt;a href="http://github.com/sunlightlabs/datacatalog-importer/blob/master/example/lib/puller.rb"&gt;Puller class&lt;/a&gt;. It is responsible for defining two methods: &lt;code&gt;initialize&lt;/code&gt; and &lt;code&gt;run&lt;/code&gt;. (The rake tasks constructed above rely on these methods.)&lt;/p&gt;
&lt;p&gt;Please note that the example provided here is oversimplified. It is intended to demonstrate how to use the importer framework, not as a practical example to copy verbatim. If you want to steal some importer code, please visit the &lt;a href="http://github.com/sunlightlabs"&gt;Sunlight Labs projects page&lt;/a&gt; and filter the projects by 'datacatalog-imp-'.&lt;/p&gt;
&lt;p&gt;As you would probably expect, &lt;code&gt;initialize&lt;/code&gt; is called once. Its main purpose is to setup the callback handler (&lt;code&gt;@handler&lt;/code&gt;) to refer back to the importer framework.&lt;/p&gt;
&lt;p&gt;Put the main logic / algorithm / secret recipe / voodoo of your importer in the &lt;code&gt;run&lt;/code&gt; method. The key responsibility of your importer is to call &lt;code&gt;@handler.source&lt;/code&gt; or &lt;code&gt;@handler.organization&lt;/code&gt; each time your importer finds a data source or organization, respectively. (Historical note: the 0.1.x version of importer framework worked a little bit differently. This is a more flexible style.)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;source parameter&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;@handler.source()&lt;/code&gt; expects a hash parameter of this shape:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="n"&gt;Budget&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="n"&gt;Congressional&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;source_type&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="n"&gt;dataset&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;documentation_url&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;license&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;license_url&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;released&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Kronos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;to_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;frequency&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="n"&gt;daily&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;period_start&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Kronos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;to_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;period_end&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Kronos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;to_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;organization&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt; # &lt;span class="n"&gt;organization&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt; &lt;span class="n"&gt;provides&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;downloads&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;
    &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="n"&gt;xml&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}]&lt;/span&gt; # &lt;span class="n"&gt;include&lt;/span&gt; &lt;span class="n"&gt;as&lt;/span&gt; &lt;span class="n"&gt;many&lt;/span&gt; &lt;span class="n"&gt;download&lt;/span&gt; &lt;span class="n"&gt;formats&lt;/span&gt; &lt;span class="n"&gt;as&lt;/span&gt; &lt;span class="n"&gt;appropiate&lt;/span&gt; 
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;custom&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;               &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;catalog_name&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;catalog_url&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Note that most of these parameters match up with the properties defined for a &lt;a href="http://github.com/sunlightlabs/datacatalog-api/blob/master/resources/sources.rb"&gt;Source in the National Data Catalog API&lt;/a&gt;. These parameters are just passed along to the API, which will validate the values.&lt;/p&gt;
&lt;p&gt;The remaining parameters (&lt;code&gt;organization&lt;/code&gt; and &lt;code&gt;downloads&lt;/code&gt;) are handled by the importer framework:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The organization sub-hash is used to lookup or create the associated organization for the source. Then a &lt;code&gt;organization_id&lt;/code&gt; key/value pair is sent to the API.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The downloads array is used to lookup or create the associate download formats for a data source.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You may have noticed the use of &lt;code&gt;Kronos.parse&lt;/code&gt; above. We highly recommend the use of the &lt;a href="http://github.com/djsun/kronos"&gt;kronos library&lt;/a&gt; for the parsing of dates.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;organization parameter&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;@handler.organization()&lt;/code&gt; expects a hash parameter of this shape:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;acronym&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;org_type&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="n"&gt;governmental&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;organization&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt; # &lt;span class="n"&gt;parent&lt;/span&gt; &lt;span class="n"&gt;organization&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;any&lt;/span&gt;
    &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;catalog_name&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;catalog_url&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &amp;quot;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&amp;quot;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Note that most of these parameters match up with the properties defined for an &lt;a href="http://github.com/sunlightlabs/datacatalog-api/blob/master/resources/organizations.rb"&gt;Organization in the National Data Catalog API&lt;/a&gt;. These parameters are just passed along to the API, which will validate the values.&lt;/p&gt;
&lt;p&gt;The remaining parameter, &lt;code&gt;organization&lt;/code&gt;, is handled by the importer framework. The framework just looks up the parent organization using the name or url. It then sends &lt;code&gt;parent_id&lt;/code&gt; with the associated parent organization id to the API.&lt;/p&gt;
&lt;h3&gt;You're Done / Best Practices&lt;/h3&gt;
&lt;p&gt;That's it. But before you go hacking away, let me say a few words about best practices:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;If you are scraping a web site, we highly recommend caching the raw HTML files in your importer. Our production importers are queued up using the NDC Importer System, which integrates nicely with git. It keeps a record of the raw HTML files that correspond to each run. This makes it easier to debug if and when things go wrong.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Take advantage of the utility functions in &lt;a href="http://github.com/sunlightlabs/datacatalog-importer/blob/master/lib/utility.rb"&gt;/lib/utility.rb&lt;/a&gt;. If you have suggestions about useful utility functions, please let us know.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;It goes without saying, but please follow best Ruby practices and make a good faith effort at writing clean code. Follow the conventions of the community and strive to make your code readable by other people.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And thanks for helping us feed the National Data Catalog!&lt;/p&gt;
&lt;h3&gt;Talk to Us / Stay Up To Date&lt;/h3&gt;
&lt;p&gt;Please reach out to us on our &lt;a href="http://groups.google.com/group/datacatalog"&gt;National Data Catalog Google Group&lt;/a&gt;. We can help you with your importer. Once it works reliably, we will want to add it to our importer system. The more up-to-date, relevant government data we bring in, the more useful our data catalog becomes.&lt;/p&gt;
&lt;p&gt;This document is adapted from the README in the &lt;a href="http://github.com/sunlightlabs/datacatalog-importer"&gt;datacatalog-importer source code repository&lt;/a&gt;. You can find the latest version there.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/sunlightlabs/blog/~4/Vttid7VJBdc" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">David James</dc:creator><pubDate>Fri, 13 Aug 2010 13:04:36 -0400</pubDate><guid isPermaLink="false">http://sunlightlabs.com/blog/2010/natdatcat-is-hungry/</guid><feedburner:origLink>http://sunlightlabs.com/blog/2010/natdatcat-is-hungry/</feedburner:origLink></item><item><title>I&amp;#39;m Kind of a Sucker for Transit Data</title><link>http://feedproxy.google.com/~r/sunlightlabs/blog/~3/mPtmmLuKO5I/</link><description>&lt;p&gt;This may admittedly be of limited interest to those outside the DC area, but it's &lt;em&gt;extremely&lt;/em&gt; interesting to me, so I'm afraid you'll just have to humor me for a paragraph or two. WMATA, our regional transit agency, has just launched a &lt;a href="http://developer.wmata.com/"&gt;developer portal and API&lt;/a&gt;, and they've done a really nice job of it.  People seem to love transit data -- after crime data it seems to be the municipal information people get most excited about (and I'd argue that it's much, &lt;em&gt;much&lt;/em&gt; more useful than crime data) -- and I'm no exception.  Playing with this stuff is a bit of a &lt;a href="http://dcist.com/2008/04/10/transit_on_thur_27.php"&gt;hobby&lt;/a&gt; of mine, and I've been following WMATA's gradual move toward openness for years.  This is a big step forward for both the agency and its customers.&lt;/p&gt;
&lt;p&gt;Bus data is still forthcoming, and I suspect that's where the real possibilities lie: the rail system is pretty easy to use; tech can pay bigger dividends when applied to the relative mysteries of the bus.  Still, it's already clear that WMATA has made some smart decisions about implementation, defined reasonable terms of service, and generally seems to be moving in the right direction.  When the API is considered alongside the already-released &lt;a href="http://www.wmata.com/rider_tools/developer_resources.cfm"&gt;GTFS&lt;/a&gt; dataset, Metro's offerings match up fairly well (though not perfectly) with the &lt;a href="http://sunlightfoundation.com/policy/documents/ten-open-data-principles/"&gt;ten open data principles&lt;/a&gt; that Sunlight has just published.&lt;/p&gt;
&lt;p&gt;Now to see if I can't get a &lt;a href="http://bmander.github.com/graphserver/"&gt;Graphserver&lt;/a&gt; instance running...&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/sunlightlabs/blog/~4/mPtmmLuKO5I" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Lee</dc:creator><pubDate>Thu, 12 Aug 2010 16:31:48 -0400</pubDate><guid isPermaLink="false">http://sunlightlabs.com/blog/2010/im-kind-of-a-sucker-for-transit-data/</guid><feedburner:origLink>http://sunlightlabs.com/blog/2010/im-kind-of-a-sucker-for-transit-data/</feedburner:origLink></item><item><title>Sunlight Labs Community Survey</title><link>http://feedproxy.google.com/~r/sunlightlabs/blog/~3/snTZzjDkGFU/</link><description>&lt;p&gt;As we say goodbye to this summer's crop of interns it reminds me that it has been over three years since my own internship here began.  When I started as an intern back in 2007 Sunlight Labs consisted of less than half a dozen developers, our blog had 20 subscribers, and we had two open source projects on Google Code.&lt;/p&gt;
&lt;p&gt;Obviously, a lot has changed.  Sunlight Labs now has approximately as many employees as all of the foundation did three years ago, and between &lt;a href="http://sunlightlabs.com/people"&gt;our website&lt;/a&gt; and &lt;a href="http://groups.google.com/group/sunlightlabs"&gt;mailing list&lt;/a&gt; we've grown into a community of thousands.  &lt;a href="http://github.com/sunlightlabs"&gt;Our github account&lt;/a&gt; now features almost 100 projects written in Ruby, Python, Javascript, Java and even a bit of ActionScript and C.&lt;/p&gt;
&lt;p&gt;Even more impressively, you all have contributed another hundred or so open source projects, from libraries to help people use &lt;a href="http://github.com/cpharmston/python-opencongress"&gt;various&lt;/a&gt; &lt;a href="http://github.com/opengovernment/govkit"&gt;Open Government&lt;/a&gt; &lt;a href="http://github.com/dqminh/transparencydata-php"&gt;APIs&lt;/a&gt; to those &lt;a href="http://sunlightlabs.com/projects/"&gt;dreamt up entirely on your own&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;One thing hasn't changed, everyone here views the community as an essential component of what we're trying to do.  ("If you build an API and nobody is there to use it does it open the government at all?")  In light of the important role that all of you play in defining and building this community it seems time for us to take a step back and look at what is and isn't working.  We want to find out what our community wants and needs from us to keep doing the important work that we're all invested in.&lt;/p&gt;
&lt;p&gt;We've put together a &lt;a href="https://spreadsheets0.google.com/viewform?formkey=dG5ydXBkZ2VSSThSR0RhLTlFM0Y0VUE6MQ"&gt;short survey&lt;/a&gt; that we'd greatly appreciate your responses on.  It shouldn't take more than ten minutes.  By answering you'll be a part of this re-evaluation of where we focus our efforts so that we can help ensure that this community stays focused and energized.  Tell us what you like about the community and where we're slacking, but most importantly tell us what you need.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://spreadsheets0.google.com/viewform?formkey=dG5ydXBkZ2VSSThSR0RhLTlFM0Y0VUE6MQ" style="font-size: 140%" &gt; Take the Survey&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The survey is open until August 27th and as a special thank you for participation we're going to have a random drawing for a $50 &lt;a href="http://www.makershed.com/"&gt;MakerShed&lt;/a&gt; gift certificate.  If you want to be entered into the drawing be sure to complete the survey and include your email address and check the box indicating you wish to be considered in the drawing.&lt;/p&gt;
&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/sunlightlabs/blog/~4/snTZzjDkGFU" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">James</dc:creator><pubDate>Thu, 12 Aug 2010 14:39:08 -0400</pubDate><guid isPermaLink="false">http://sunlightlabs.com/blog/2010/community-survey/</guid><feedburner:origLink>http://sunlightlabs.com/blog/2010/community-survey/</feedburner:origLink></item><item><title>Building Poligraft</title><link>http://feedproxy.google.com/~r/sunlightlabs/blog/~3/3tUkEZXPuxE/</link><description>&lt;p&gt;&lt;a href="http://poligraft.com"&gt;&lt;img src="http://assets.sunlightfoundation.com.s3.amazonaws.com/images/poligfaftlogosm.png" class="detailimage"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I'm happy to announce the newest project from Sunlight Labs, &lt;a href="http://poligraft.com"&gt;Poligraft&lt;/a&gt;. A utility built on top of &lt;a href="http://transparencydata.com"&gt;Transparency Data&lt;/a&gt;, Poligraft takes in a block of text, parses it for entities like politicians and corporations, and returns a result set representing the political influence contained in that text. I won't dwell on the features -- read Ellen Miller's &lt;a href="http://blog.sunlightfoundation.com/2010/08/05/poligraft-brings-politics-influences-together-in-one-click"&gt;announcement blog post&lt;/a&gt;  and the &lt;a href="http://poligraft.com/about"&gt;about&lt;/a&gt; page for more information. What I want to talk about instead is the development process.&lt;/p&gt;
&lt;h3&gt;Third Time's A Charm&lt;/h3&gt;
&lt;p&gt;The idea behind Poligraft is not new. Back in late 2007, well before I joined Sunlight, the nascent Labs team attempted an initial version of the concept that didn't pan out. Then in early 2009, I still wasn't at Sunlight, but I did develop an entry for the first &lt;a href="http://sunlightlabs.com/contests/appsforamerica/"&gt;Apps for America&lt;/a&gt; contest that was called &lt;a href="http://sunlightlabs.com/contests/appsforamerica/apps/defogger/"&gt;Defogger&lt;/a&gt;. Defogger was embarrassingly slow, didn't use any AJAX updating, and stopped short of making the connections between entities that Poligraft does today. Much more worthy apps placed at the top of Apps for America.&lt;/p&gt;
&lt;p&gt;But in developing Defogger, I did build a key piece of the puzzle used in Poligraft: what I now call the &lt;a href="http://github.com/sunlightlabs/poligraft/blob/master/app/models/content_plucker.rb"&gt;content plucker&lt;/a&gt;. Since the best way to use Poligraft is through the bookmarklet, the app needs to pull out the article content from an arbitrary URL. Thankfully, the &lt;a href="http://lab.arc90.com/experiments/readability/"&gt;Readability&lt;/a&gt; bookmarklet does that exact thing, and the code is &lt;a href="http://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js"&gt;open source&lt;/a&gt;. The algorithm examines containing elements for paragraphs, and assigns more points to the containers that are more likely to contain the page's main content. Porting the algorithm from Javascript to Ruby was a fun exercise in screen scraping.&lt;/p&gt;
&lt;h3&gt;Harnessing APIs&lt;/h3&gt;
&lt;p&gt;With the content plucked, Poligraft extracts the entities (people, organizations, companies) from the text with the &lt;a href="http://opencalais.com"&gt;Calais&lt;/a&gt; API. A service by Thomson Reuters, Calais semantically processes any given text, and returns a rich representation of that text. It's very detailed, much more so than what Poligraft needs. Try out the &lt;a href="http://viewer.opencalais.com/"&gt;Calais Viewer&lt;/a&gt; to see what I mean.&lt;/p&gt;
&lt;p&gt;Using the people, companies, and organizations that Calais detected, Poligraft then uses the &lt;a href="http://transparencydata.com/api/"&gt;Transparency Data API&lt;/a&gt; in three steps. First, the Transparency Data &lt;a href="http://transparencydata.com/api/aggregates/contributions/#search-methods"&gt;entity search&lt;/a&gt; is called on each Calais entity. This will usually weed out the majority of entities detected by Calais, because we're only focusing on entities that have something to do with campaign contributions. These are the "Points of Influence" you see in the sidebar, and you can sometimes see the "weed out" step if you watch closely. Second, on that subset of entities, Poligraft uses the Transparency Data &lt;a href="http://transparencydata.com/api/aggregates/contributions/#politician-methods"&gt;aggregate endpoints&lt;/a&gt; to draw the graphs you see on the sidebar. Third, the "Aggregated Contributions" section in the sidebar is filled out using a pairwise aggregation endpoint that is not yet described in the official Transparency Data API documentation. It'll be ready for public use very soon.&lt;/p&gt;
&lt;h3&gt;Providing an API&lt;/h3&gt;
&lt;p&gt;Poligraft also has its own built-in API, which is used by Poligraft itself for dynamically populating the results page via AJAX. Specify a URL or text to be processed, and get back the results in JSON format. In fact, every result page in Poligraft has a corresponding JSON representation. Just append a &lt;code&gt;.json&lt;/code&gt; to the unique slug, &lt;a href="http://poligraft.com/vyJf.json"&gt;like so&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To process an article, use the &lt;code&gt;http://poligraft.com/poligraft&lt;/code&gt; endpoint in conjunction with a &lt;code&gt;POST&lt;/code&gt; or &lt;code&gt;GET&lt;/code&gt; request:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;http://poligraft.com/poligraft?url=ARTICLE-URL-HERE&amp;amp;json=1&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Be sure to pass in &lt;code&gt;json=1&lt;/code&gt; or else HTML will be returned, and use &lt;code&gt;url=&lt;/code&gt; to pass in a URL or &lt;code&gt;text=&lt;/code&gt; to pass in a selection of text. HTTP clients must have redirection enabled, as the response will be a redirect to a slug endpoint like &lt;code&gt;http://poligraft.com/ABCD.json&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Because Poligraft does processing asynchronously, this endpoint will return a &lt;code&gt;202 ACCEPTED&lt;/code&gt; code until processing is finished, when it returns a &lt;code&gt;200 OK&lt;/code&gt;. In addition to the HTTP response code, there's a top-level field in the JSON called &lt;code&gt;processed&lt;/code&gt; which is set to &lt;code&gt;false&lt;/code&gt; while processing is active. Poll the endpoint every few seconds until the return code is &lt;code&gt;200&lt;/code&gt; or the &lt;code&gt;processed&lt;/code&gt; value in the JSON is true. Both techniques will work.&lt;/p&gt;
&lt;h3&gt;Open Source + Open Data&lt;/h3&gt;
&lt;p&gt;As usual, the code behind Poligraft is open source on &lt;a href="http://github.com/sunlightlabs/poligraft"&gt;GitHub&lt;/a&gt;. The APIs it uses are available for use, for free. Specifically, the Transparency Data API is incredibly valuable for building tools and apps that examine and visualize political influence. While building Poligraft, I was pleasantly surprised on many occasions by what Transparency Data provides. In the months and years to come, I hope we see many more apps built on top of it, not just from within the Labs, but from the wider community. &lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/sunlightlabs/blog/~4/3tUkEZXPuxE" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Luigi Montanez</dc:creator><pubDate>Fri, 06 Aug 2010 11:44:58 -0400</pubDate><guid isPermaLink="false">http://sunlightlabs.com/blog/2010/building-poligraft/</guid><feedburner:origLink>http://sunlightlabs.com/blog/2010/building-poligraft/</feedburner:origLink></item><item><title>We Don&amp;#39;t Need a GitHub for Data</title><link>http://feedproxy.google.com/~r/sunlightlabs/blog/~3/7kfNRNPScw4/</link><description>&lt;p&gt;&lt;img alt="picture of Lt. Commander Data standing in front of a screen with the GitHub log" title="corny, I know" src="http://assets.sunlightlabs.com/blog/github_for_data.jpg" class="detailimage_right"&gt;There was an interesting exchange this past weekend between Derek Willis of the New York Times and Sunlight's own Labs Director emeritus, Clay Johnson.  Clay wrote &lt;a href="http://infovegan.com/2010/07/30/github-for-data"&gt;a post&lt;/a&gt; arguing that we need a "GitHub for data":&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It's too hard to put data on the web. It’s too hard to get data off the web. We need a GitHub for data.&lt;/p&gt;
&lt;p&gt;With a good version control system like Git or Mercurial, I can track changes, I can do rollbacks, branch and merge and most importantly, collaborate. With a web counterpart like GitHub I can see who is branching my source, what’s been done to it, they can easily contribute back and people can create issues and a wiki about the source I’ve written. To publish source to the web, I need only configure my GitHub account, and in my editor I can add a file, commit the change, and publish it to the web in a couple quick keystrokes.&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;Getting and integrating data into a project needs to be as easy as integrating code into a project. If I want to interface with Google Analytics with ruby, I can type gem install vigetlabs-garb and I’ve got what I need to talk to the Google Analytics API. Why can I not type into a console gitdata install census-2010 or gitdata install census-2010 —format=mongodb and have everything I need to interface with the coming census data?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;On his own blog, &lt;a href="http://blog.thescoop.org/archives/2010/07/31/a-GitHub-for-data/"&gt;Derek pushed back a bit&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[...] The biggest issue, for data-driven apps contests and pretty much any other use of government data, is not that data isn’t easy to store on the Web. It’s that data is hard to understand, no matter where you get it.&lt;/p&gt;
&lt;p&gt;[...]&lt;/p&gt;
&lt;p&gt;What I’m saying is that the very act of what Clay describes as a hassle:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A developer has to download some strange dataset off of a website like data.gov or the National Data Catalog, prune it, massage it, usually fix it, and then convert it to their database system of choice, and then they can start building their app.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Is in fact what helps a user learn more about the dataset he or she is using. Even a well-documented dataset can have its quirks that show up only in the data itself, and the act of importing often reveals more about the data than the documentation does. We need to import, prune, massage, convert. It’s how we learn.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I think there's a lot to what Derek is saying.  Understanding what an MSA is, or how to match Census data up against information that's been geocoded by zip code -- these are bigger challenges than figuring out how to get the Census data itself.  The documentation for this stuff is difficult to find and even harder to understand.  Most users are driven toward the &lt;a href="http://factfinder.census.gov/home/saff/main.html?_lang=en"&gt;American Factfinder tool&lt;/a&gt;, but if that's not up to telling you what you want, you're going to have to spend some time hunting down the appropriate FTP site and an explanation of its organization -- Clay's right that this is a pain.  But it's nothing compared to the challenge of figuring out how to use the data properly.  It can be daunting.&lt;/p&gt;
&lt;p&gt;But I think there are problems with the "GitHub for data" framing that go beyond the simple fact that the problems GitHub solves aren't the biggest problems facing analysts.  Before I talk about that, though, let me applaud Clay for something he did in his post: he used specific examples.  I have to admit I bristle when people start talking about what could or needs to be done with "data" in the abstract.  I think that level of vagueness makes it tough to reach useful conclusions.  As Derek pointed out, meaningful analysis requires that we roll up our sleeves and begin to deal with the specifics of a dataset.  A focus on the most abstract level of discussion can also sometimes be symptomatic of generalists who find the idea of big datasets exciting, but have never actually tried to conduct an analysis themselves.  I'm glad for their enthusiasm, but hesitant to spend much time listening to their recommendations.  If someone doesn't know what a normal distribution is, I have doubts about how much meaningful insight they can contribute to discussions about appropriate tools and policies for open data.  Anyway, cheers to Clay for avoiding this trap.&lt;/p&gt;
&lt;p&gt;So! On to my quibbles.  My sense is that the phrase "GitHub for [blank]" is getting thrown around a lot right now because GitHub is still relatively new, Git itself is innately mysterious and powerful, and there's a general sense that a lot of exciting things are happening within the GH community.  GitHub brought social features to the world of code repositories, and did so with impressive execution.  That's a real innovation, and people are justifiably excited about it.&lt;/p&gt;
&lt;p&gt;I'm all for taking advantage of social web innovations within analyst communities.  That's part of the vision for the &lt;a href="http://nationaldatacatalog.com"&gt;National Data Catalog&lt;/a&gt;, after all -- having a place where people interested in the same problems can find each other and share their work.  But if that's all we're after, I'd rather start talking about a "Flickr for data" or "Reddit for data" -- I think that both framings might offer a bit less novelty, but could save us from making some serious conceptual mistakes.&lt;/p&gt;
&lt;p&gt;And here I'm referring to the core functionality of GitHub: version control.  Simply put, &lt;strong&gt;version control doesn't make sense for data&lt;/strong&gt;.  In one sense it seems to, because both code and datasets demand auditability -- the ability to trace an artifact's current state back to its origin.  Version control can certainly do this.  For code, one can look at a diff and see what has changed and why.  The transformation between one revision and another will probably involve some lines being changed, and others not, and hopefully the insertion of inline comments.  That, plus the comment associated with the revision, will usually be enough for a viewer to understand the transformation.&lt;/p&gt;
&lt;p&gt;Transformations on datasets aren't like that.  If you normalize a vector, &lt;strong&gt;every number in it will change&lt;/strong&gt;.  This is going to make your version control system suck up a ton of space, of course, but the bigger problem is that while you might be able to tease out the nature of the transformation by looking at the before and after snapshots, it's going to be much harder to do so than it is with code.  &lt;strong&gt;The transformation is the thing that is interesting, and the thing that may need revision in order to fix mistakes&lt;/strong&gt;.  And in most cases, the transformation is going to be written in code of one type or another.  That's what needs to go into the version control system -- not the data.&lt;/p&gt;
&lt;p&gt;In fact, I'd argue that the data &lt;strong&gt;shouldn't&lt;/strong&gt; go into the VCS.  Because there's another big difference between evolving data and evolving code: &lt;strong&gt;worked-on code tends to get better, while worked-on data tends to get worse&lt;/strong&gt;.  Not for the person doing the work, maybe.  But for everyone else, I think this generally holds true.  The original data is sacred, in a way -- there may be effects hidden within it that aren't immediately apparent, and which ought to be preserved.  The transformations that the data must undergo will often lead to the loss of information -- a necessary step in service of analysis, but one that needs to be kept in mind, as that information may prove to be valuable, sometimes in ways that can't be anticipated.&lt;/p&gt;
&lt;p&gt;I'll make up an example: let's say someone's studying dolphin calls to see if they can be computationally differentiated.  They produce some high-resolution underwater audio recordings in a PCM format, then set about improving the dataset to facilitate others' analysis.  Maybe they do some filtering to remove noises outside the frequency band known to be used by dolphins, then they compress everything down to MP3 to facilitate distribution.&lt;/p&gt;
&lt;p&gt;That might be okay, but it's not sufficient.  We can't look at the difference between the MP3 and the source audio and know what we've lost, or what assumptions might be coloring our analysis of the derived data.  We need to know about the methodology that was used to get from one point to another, so that we can, for example, revisit our assumptions made about the frequency range we're examining (maybe we discover something new about dolphin biology; or maybe there's a harmonic that travels further underwater than we'd expected).  Ideally, we'll just share those transformations along with the source data and let other users run them themselves -- like &lt;strong&gt;make &amp;amp;&amp;amp; make install&lt;/strong&gt;.  We should only be shuttling around "improved" data when there are practical reasons for doing so (typically related to filesize or the transformations being extremely computationally demanding) and when the transformations have been thoroughly reviewed.  I'm not confident that a freewheeling environment akin to GitHub can reach that level of control and rigor -- nor should it, in my opinion.&lt;/p&gt;
&lt;p&gt;So let's share our transformations -- our code -- in a social way.  I think that'll work fine, and convey real advantages.  Better still, the tools are already built.  But data is different than code, and we should think carefully before we jam it into the conceptual framework of VCS.  That isn't to say that we don't need better tools for managing it -- and the good news here is that people like &lt;a href="http://thedata.org/home"&gt;Harvard's Gary King are thinking hard about what those tools might look like&lt;/a&gt;.  But GitHub is the wrong model.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/sunlightlabs/blog/~4/7kfNRNPScw4" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tom Lee</dc:creator><pubDate>Thu, 05 Aug 2010 12:04:53 -0400</pubDate><guid isPermaLink="false">http://sunlightlabs.com/blog/2010/we-dont-need-a-github-for-data/</guid><feedburner:origLink>http://sunlightlabs.com/blog/2010/we-dont-need-a-github-for-data/</feedburner:origLink></item></channel></rss>
