Trevor Muñoz

Making Digital Humanities Work

2014-07-14T00:00:00+00:00

Here is a lightly edited text of the long (20-minute) paper that Jennifer Guiliano and I co-authored and delivered at the Digital Humanities 2014 conference in Lausanne, Switzerland. As we noted at the time, this talk is extracted from a larger work in progress in which we are trying to get to grips with what seem to be persistent challenges in organizing the work of digital humanities research.

Since we were fortunate to receive so many excellent comments and questions from the audience who could attend in Lausanne, we wanted to offer the text here as a way of facilitating continuation of the conversation. We warmly welcome further comments, challenges, questions, corrections and disagreements.

Finally, some of the comments on Twitter made clear that we could have signposted more clearly that many of the issues we take up here might be particular to the context of U.S. universities and less relevant to different institutional or international contexts.…

Cross-posted at jguiliano.com

This paper, officially titled “Making Digital Humanities Work”, is on its face about an activity we created to engage librarians at the University of Maryland in digital humanities. This activity, which we called the “Digital Humanities Incubator” might look outwardly like a training program but that label will be slightly misleading for the purposes of this discussion. We’ll reveal what the incubator encompassed and how it worked and didn’t work.

Yet this “resume” of the activity—what is usually considered the meat of the presentation is provided here merely as background to an attempt to grapple with larger problems that arise in any effort to organize DH work.

Through discussing our experience with planning the incubator and then creating and running a first iteration of the activity, we hope to do four things: first, to argue for the continuing urgent need to reflectively critique the structure and practice of labor in the (digital) humanities; second, to suggest that there are valuable insights to be gained by taking a “deflationary” approach as part of this critical examination; third, to contribute a few insights (we hope) from our local situation; and finally, fourth, to encourage future work along these lines of theoretical engagement. Thus in this presentation, our treatment of the digital humanities incubator is both a reflective and prospective endeavor.

For us, the activity of trying to figure out how to bring new participants into the digital humanities has functioned as an exercise in critical design—we have used the making and re-making of this activity as a space for thinking through alternate ways to make the digital humanities work—ultimately culminating in the Digital Humanities Incubator. We hope it will be clear over the course of the talk that the primary change we were trying to effect was not really skill acquisition—thus, why labeling the Incubator a training initiative is slightly misleading. The primary change we hoped for had more to do with empowering librarians to see their position differently with regard to digital humanities research.

While much has been said and written about securing more resources and recruiting and developing talented labor, this approach runs the risk of slipping into individual anecdote—isn’t each institutional situation unique?—when what is needed is structural critique of the larger system within which DH work exists. Privileging anecdote—even cast as “advice” or “lessons learned”—risks falling into a constructionist fallacy of progress and improvement—more digital humanists, more money for digital humanities centers/labs/initiatives will lead to the Transformation (capital “T) of the humanities and the academy. More simply, ignoring the system in favor of the individual fails our humanistic tradition of critical analysis and theoretical engagement.

Bringing the theoretical concerns of our disciplines (history and library and information science, in this case) to the daily practice of our digital humanities work, our initial goal with the structure and curriculum of the incubator was to attempt to more accurately communicate how digital humanities project work takes place rather than to retread pre-existing patterns of digital humanities research.

The doubled meaning of our title reflects this tension. ‘Work’ here means the labor involved in the production of both the incubator and the many collaborative digital humanities research projects it is intended to mimic.

It also has the sense of ‘to work'—a measure of activity and the value assigned to it within a given system of production both internally and externally.

Background

We begin our presentation with where we ended—the structure and content of our first iteration of the Digital Humanities Incubator and then we will move backwards to unfold the motivations and decisions that led us to this conscious intervention around digital humanities work.

Participants in the Digital Humanities Incubator were librarians from the University of Maryland during the 2013-2014 year. The Incubator targeted academic librarians of all ranks, library staff, and graduate assistants who wished to gain new capabilities for developing, supporting, and leading digital projects. The Incubator activity was part of a conscious effort by library leadership to correct a lack of institutional support, and engagement with, digital humanities research by librarians. Over the course of the Incubator, forty-two librarians, staff, and graduate students participated in at least one workshops and exercise set. A significant portion of this number participated in all the activities.

Workshop and exercise topics covered the process of developing digital humanities project ideas, finding data, evaluating tools, and crafting a compelling proposal for internal or external funding.

Our first workshop focused on providing attendees with a sense of contemporary definitions of DH, an overview of the scope and variety of DH approaches, and perspective on this sprawling landscape such that, by the close of the session, participants could articulate some ideas of how DH might be connected or applied to their own librarianship, research, or service.

Workshop two focused on the mechanisms of developing DH research ideas. Participants were encouraged to approach their own existing library work as material that could be framed as research ideas. This workshop also discussed strategies to determine the value of these research ideas as potential interventions in the ongoing discourses of digital humanities. Our goal was that by the close of the second workshop, participants could communicate at least one idea of something from their regular practice of librarianship as a research idea.

Workshop three focused on preparing and working with data—the basis of many digital humanities research projects. Participants were given a series of tutorials that allowed them to hunt and gather data, sort and improve it via common data tools, like Open Refine, and transform it for analytical and display use. The goal of the workshop was for each participant to gain experience in identifying data that might pertain to their own research interests and to explore the different ways common digital tools might analyse and visualize that data.

Workshop four aided participants in developing the research idea they’d been working with throughout the sequence into an externally-communicated project proposal. Participants were provided with an overview of the key components of project proposals (goals, workplan, staffing, budget, etc) and worked to elucidate those individually (or collectively where needed). The goal of this workshop was less a fully formulated project proposal than the identification by participants of the resources and partnerships they might need to bring their research idea to fruition.

Skills learned in this program were grounded in participants’ own project ideas and interests, supported by the brief in-depth lectures described above. This approach was intentional as we sought participants to focus on their roles as authors and leaders of the academic research mission of the university—marrying information organization and curation with scholarship and public engagement through the use of digital technologies.

The Incubator was implemented using in-kind support from MITH in the form of staff time—each of us wrote the lectures, led office hours, and staffed the individual consultations leading up to the close of the workshop cycle with one other MITH staff member contributing a limited amount of instructional support for the data workshop.

Low availability of resources (translated mechanistically into small staff commitment) meant that we were able to be flexible and agile in adapting the program to the needs of librarians. For example, we needed to schedule events with librarians differently than we might with other faculty participants due to different expectations of work schedules. Each workshop needed to be held twice—once in the morning and once in the afternoon on alternate days, spaced at least two days apart. This allowed part-time librarians, library division heads, and full-time reference/subject librarians to have the opportunity to attend without having to make special work arrangements. This also allowed for librarians to potentially hear the same workshop content twice—an allowance useful for differing types of learning styles. Librarians were able to attend the first workshop, listening and taking notes, then return for the second to ask questions and further understand what they might have missed the first time. This was particularly important for the “data” workshop where participants were encouraged to scrape, manipulate, and explore data in ways unfamiliar to many librarians but crucial to much digital humanities work.

Additionally, by interlacing office hours with each workshop, the co-instructors were able to revise content, post answers to questions, and facilitate immediate interventions on the project level where needed. In part, this blended structure combining instruction with hands-on work with participants’ own project ideas allowed for quick implementation, demonstration of core concepts, discussion and reflection. It also gave participants a sense of a dedicated individual with whom they could develop a personal relationship. This relationship-building was crucial to alleviate feelings of anxiety and to counteract assumptions that particular topics or projects were not “research” or not “digital humanities” enough. Simply, building up familiarity between librarians and instructors helped librarians feel welcome in the space of the digital humanities. Further, librarians attending without their own project were encouraged to join a colleague’s. This “learn and do” philosophy enabled collaboration across divisions in rewarding ways that we do not, unfortunately, have time to explore here.

Incubator participants were encouraged but not required to “pitch” an idea for a digital project during the capstone event for the Incubator program. These short (5-minute) presentations drew on skills acquired during the Incubator, forcing participants to frame their research interests in terms of potential audiences, available technologies, and overall feasibility of timeframes and budgets. Thirteen librarians or teams of librarians participated in the “pitch round” during the capstone event. These presentations were attended and evaluated by an audience of their peers from the Libraries as well as by MITH staff. Five finalists were chosen to continue refining their proposals for a potential longer term joint project. Putting potential project ideas in front of an audience both reflected back to the community at large the dynamism and diversity of research ideas coming from librarians but also assured potential project leaders of an engaged and supportive audience for work of this kind. [Ed.: Prompted by an excellent question from the audience, we clarified that this project pitch was not meant to encourage participants to “package” or close off their ideas for an external requirement. Rather we attempted to use the discipline of the proposal to help participants focus on distilling and communicating their ideas to a public audience. We also acknowledged that this way of wrapping up the Incubator sequence is a vestigial gesture that falls back in some of the habits we were trying to distance the new activity from.]

“Unsuccessful” proposals also generated new digital initiatives for the Libraries by being turned back to divisions that could assist in their development. These spin-off projects—one related to crowdsourcing and one related to acquisition of basic programming skills—found homes as internal library projects. These conjoined activities—workshop, office hours, consultation, and pitch—encapsulated the “work” of the Digital Humanities Incubator. To be clear, our intention is not to present the Incubator as an unqualified success. While very successful internally to the University of Maryland Libraries and MITH in terms of generating enthusiasm and contributing to a broader positive change in tenor of conversations around research, digital technologies, and collaboration across the institution, the activity has been difficult to sustain for a variety of reasons.

Need for Critique

The design of the activity that we’ve just described was an attempt to respond to shortcomings within the model of collaborative practice we had received and participated in as working digital humanists. Fellowships for faculty to conduct digital humanities research in partnership with librarians, technologists, or other university staff have been a common way of organizing DH work. Although faculty fellowships have been mothballed at many established digital humanities centers (e.g. the Center for Digital Research in the Humanities (CDRH) at the University of Nebraska, MATRIX at Michigan State, ICHASS at Illinois), many programs, centers, and initiatives continue to offer these opportunities as their main form of faculty engagement. The 5 College Program, the Tri-County Digital Humanities Program, the Institute for Digital Arts and Humanities at Indiana, as well as the much more established Institutes of Advanced Technology at UVA and MITH at Maryland provide these as a key component of digital humanities work on their campuses.

Identifying that faculty fellowship is often a first form of digital humanities work when establishing new programs and initiatives, we seek here to historically-contextualize the evolution of faculty fellowships and problematize what we believe is the unthinking reproduction of a system with significant flaws. At MITH, while faculty fellowship opportunities have ostensibly been open to faculty from the College of Arts and Humanities and librarians, in practice, recipients have been overwhelmingly tenured or tenure-track faculty from the College of Arts and Humanities. How did MITH arrive at the point of only supporting this very narrow slice of digital humanities work?

In part, one cause of this limitation can be found at the root of the model MITH adopted and adapted from the Institute for Advanced Technology in the Humanities (IATH) at the University of Virginia. When MITH initiated its fellowship program in 1999/2000, IATH was considered a leader in digital humanities. Its faculty fellows program had produced some of the most innovative digital projects of the time (Valley of the Shadow, The Rosetti Archive, The William Blake Archive, etc.) Drawn from humanities faculty, IATH fellows were selected through a competitive application process with the winners receiving two parallel packages. From their home departments, they were granted half-time teaching release as well as ten hours a week of student assistance during the academic year. From IATH, they received an additional ten hours of graduate assistance per week, a $10,000 project budget (1992 dollars), office space and equipment, programming, information systems design, and consulting for the life of the project. Adjusted for inflation and using an averaged cost for graduate student labor, faculty fellows were thus granted a one-time investment of roughly $50,000 plus additional support in personnel costs for programming, information systems, and consulting. John Unsworth, former director of IATH, estimates that over the course of a single fellowship, faculty fellows received “hundreds of thousands of dollars of support.” More significantly, though, IATH fellows were part of a much broader system of digital humanities work that drew upon expertise at the E-Text Center, the UVA Libraries, and other campus based resources. Library staff were embedded within the work of these fellowships both implicitly and explicitly as they provided an expertise difficult to find elsewhere.

An overwhelming number of the initial IATH fellowships were awarded to full professors (all nine in the first two years) with the first two projects (Valley and Dante) being used as demonstrations to IBM of the viability of humanities driven research involving graduate students. And just as significantly, many of the fellowships arose from canonical questions within their disciplines: the causes and effect of the U.S. Civil War, the evolution of Walt Whitman’s work, the trajectory and complexity of William Blake’s work, etc. Such fellowships fixed novel, alternative and contingent digital activities to existing structures under the rubric of innovation or enhancement to existing scholarly activities.

Under such a model, faculty can be recruited as participants, supporters, and advocates and digital humanities initiatives and centers can demonstrate their effects on the local academic community—as service, if nothing else. For a new endeavor in the fundamentally conservative academy, the adoption of a pre-made structure of organizing labor, the “fellowship”, was politically astute and laid much of the groundwork for the wide success of the field today. Fellows are advocates for the digital humanities and examples of its proliferation into humanities disciplines. These attractions—cultivating “star” faculty, advancing easily recognized “canonical” topics, and developing a constituency—still might suggest that the best move for newcomers to digital humanities is to adopt an old pattern of organizing work.

Most initiatives today that offer these types of opportunities do so using grant funding (e.g. from the Andrew W. Mellon Foundation or other funders) or using internal funds designed to be matched through external grants. Departments often struggle to recognize this type of fellowship where the department bears the burden of funding a replacement instructor rather than a grant or external fellowship providing those buyouts. Remembering our own caution against anecdote, we want to direct attention here not to the particularity of these details but to the ways in which the practice of digital humanities as “fellowships” was/is imbricated in the structures of the larger institutions.

At a broader level, this pattern of faculty fellowships—the subsidizing of the “work” of being a humanist—tends to only recognize a narrow construction of what being a humanist means and what is encompassed in humanities research.

That construction is often deeply conservative both in the sense of recognizing only the same hierarchies of status and privilege and being quantified in terms set by the prevailing values of the increasingly corporate university.

This is pernicious in itself but faculty fellowships ultimately also devalue other kinds of work—that produced by the rest of the economy that is required to create new knowledge. It is paramount to recognize here the ways in which the value of “work” has changed—the merits by which a project was considered complete have continued to shift away from experimentation and towards production. As digital humanities evolved, so too did the scale of work that individual fellows attempted during their fellowship period. Expectations both on behalf of the fellow in terms of what could be completed within the year as well as rising expectations of what constitutes good or interesting digital humanities scholarship creates both internal and external pressures that work against critical awareness of complicity within the neoliberal university.

Value of a “Deflationary” Approach

While the project of applying reflexive critique to labor has long been an avenue for humanistic analysis of a number of domains, in the context of digital humanities research such critique has been woefully underdeveloped. The digital humanities has committed itself to critique and exploration of existing structures in scholarly communication (e.g. new modes of publishing) as well as tenure and promotion yet has not extended this scepticism to more kinds of work practices. This seems a logical next step and one more relevant to consideration of participants not in tenure-track faculty roles.

In the version of his essay “Where is cultural criticism in the digital humanities?” published in the Debates in the Digital Humanities volume, Alan Liu focuses on the production and consumption of “the digital humanities” in terms of what he calls “the great post-industrial, neoliberal, corporate, and global flows of information-cum-capital.” This commitment to examining the digital humanities from a world-system view does much to explain, if not excuse, the caricature of digital humanities work that follows. We would escape this rhetorical cycle by taking Liu’s question extremely literally.

We argue that focusing precisely on the where of the digital humanities, that is on sets of material practices employed at a specific site during the production of digital humanities work re-opens stale questions to new analytical approaches and, much better, to new avenues of intervention. Indeed, to save the question of “Where is cultural criticism in the digital humanities?” from tendentiousness requires richer, more detailed accounts of the work of digital humanities in materialist terms to supplement Liu’s high-level cultural studies perspective. Such a “deflationary” approach helps us escape what Latour referred to as the “cognitive fallacy”—treating research as though it occurs chiefly or wholly in the mind rather than as a set of trajectories through material practices. We advance this somewhat Latourian perspective as only one way of improving the critical attention paid to labor in the digital humanities. There are of course additional lenses that could be usefully applied. For instance, we have not yet treated the “affective labor” that plays a vital role in DH work nor undertaken to analyze the ways in which “professionalization” deforms labor and labor advocacy.

The subjects of our examination need not be radically different. We should discuss many of the same things we’ve discussed in the past—funding, staffing, space, tools, etc.—not as anecdotes or best practices but as effects of larger structural dynamics that we should regard critically.

Drawing attention to labor, to work, and to the materialist realities of both, allows us to underline more and different segments of the trajectories of what constitutes digital humanities (borrowing again from our keynote speaker). We believe this is an additional avenue of critique that should be explored alongside analyses like Liu’s or Matthew Kirschenbaum’s that treat “the digital humanities” chiefly as a symbols or discourses in an agonistic world system. This presentation is a first effort to call out the need for approaches that neither cultural studies nor digital humanities usually provide and should serve as a provocation to our wider community to explore where digital humanities work takes place, especially that which is undertaken by other workers within the larger academic system of labor.

Data Driven but How Do We Steer This Thing?

2014-06-22T00:00:00+00:00

Here’s the abstract I submitted for the talk:

Much of the discussion of digital humanities in libraries is directed to programmatic questions: who to hire for library-based digital humanities work, what skills might these people need, how best to house and equip new (or old) digital initiatives, what projects and partnerships to pursue. When discussions do turn to the mission or purpose of digital humanities in libraries, these debates often seem drained of the animated specificity devoted to administrative, programmatic questions. Redressing this imbalance in our professional attention as a library profession can strengthen our planning for, participation in, and leadership of digital humanities scholarship. This talk then is intended as one contribution toward the project of better articulating a theory that can shape and guide libraries’ digital humanities practice. By tracing librarianship’s historical self-understanding and identifying points of connection between library theory and some of the major ideas of humanistic scholarship, it is possible to show how and why digital humanities research should be part of the core work of libraries.

And here are the slides from the talk I actually delivered:

View on Speaker Deck

References

Augst, Thomas. “Faith in Reading: Public Libraries, Liberalism, and the Civil Religion.” In Institutions of Reading: The Social Life of Libraries in the United States, edited by Thomas Augst and Kenneth E. Carpenter. University of Massachusetts Press, 2007.

Bates, Marcia J. “The Invisible Substrate of Information Science.” Journal of the American Society for Information Science 50, no. 12 (January 1, 1999): 1043–50.

Björgvinsson, Erling, Pelle Ehn, and Per-Anders Hillgren. “Participatory Design and ‘Democratizing Innovation.’” In Proceedings of the 11th Biennial Participatory Design Conference, 41–50. PDC ’10. New York, NY, USA: ACM, 2010. doi:10.1145/1900441.1900448.

Bourg, Chris. “Access to Information and Socio-Economic Status.” Feral Librarian. Accessed June 22, 2014. http://chrisbourg.wordpress.com/2013/09/26/access-to-information-and-socio-economic-status/.

Bourg, Chris. “The Unbearable Whiteness of Librarianship.” Feral Librarian. Accessed June 22, 2014. http://chrisbourg.wordpress.com/2014/03/03/the-unbearable-whiteness-of-librarianship/.

Buckland, Michael K. “Democratic Theory in Library Information Science.” Journal of the American Society for Information Science and Technology 59, no. 9 (July 1, 2008): 1534–1534. doi:10.1002/asi.20846.

Committee, ACRL Research Planning and Review. “Top Trends in Academic Libraries A Review of the Trends and Issues Affecting Academic Libraries in Higher Education.” College & Research Libraries News 75, no. 6 (June 1, 2014): 294–302. http://crln.acrl.org/content/75/6/294.full.

Dantec, Christoper Le, and Carl DiSalvo. “Infrastructuring and the Formation of Publics in Participatory Design.” Social Studies of Science, February 26, 2013. doi:10.1177/0306312712471581.

Flanders, Julia. “The Productive Unease of 21st-Century Digital Scholarship” 3, no. 3 (2009). http://www.digitalhumanities.org/dhq/vol/3/3/000055/000055.html.

Gitelman, Lisa. Paper Knowledge: Toward a Media History of Documents, 2014.

Kline, Ronald, and Trevor Pinch. “Users as Agents of Technological Change: The Social Construction of the Automobile in the Rural United States.” Technology and Culture 37, no. 4 (October 1, 1996): 763–95. doi:10.2307/3107097.

Mattern, Shannon. “Library as Infrastructure.” Design Observer, June 9, 2014. http://places.designobserver.com/feature/library-as-infrastructure/38488/.

McPherson, Tara. “Designing for Difference.” Differences 25, no. 1 (January 1, 2014): 177–88. doi:10.1215/10407391-2420039.

Morozov, Evgeny. To Save Everything, Click Here: The Folly of Technological Solutionism, 2013.

Pawley, Christine. “Beyond Market Models and Resistance: Organizations as a Middle Layer in the History of Reading.” The Library Quarterly 79, no. 1 (January 1, 2009): 73–93. doi:10.1086/596580.

Pawley, Christine. “Hegemony’s Handmaid? The Library and Information Studies Curriculum from a Class Perspective.” The Library Quarterly 68, no. 2 (April 1, 1998): 123–44.

Posner, Miriam. “No Half Measures: Overcoming Common Challenges to Doing Digital Humanities in the Library.” Journal of Library Administration 53, no. 1 (January 2013): 43–52. doi:10.1080/01930826.2013.756694.

Weinberger, David. “Library as Platform.” Library Journal. Accessed June 22, 2014. http://lj.libraryjournal.com/2012/09/future-of-libraries/by-david-weinberger/.

Wiegand, Wayne A. “The Development of Librarianship in the United States.” Libraries & Culture 24, no. 1 (January 1, 1989): 99–109.

Worthey, Glen. “Literary Texts and the Library in the Digital Age, Or, How Library DH Is Made.” Accessed June 22, 2014. https://digitalhumanities.stanford.edu/literary-texts-and-library-digital-age-or-how-library-dh-made.

When a Woman Collects Menus

2014-04-16T00:00:00+00:00

Borrow a Cup of Sugar? Or Your Data Analysis Tools? — More work with NYPL's open data, Part Three

2014-01-10T00:00:00+00:00

This is the third part of a continuing series on curating data from the New York Public Library’s What’s On the Menu? project. The first two parts can be found here and here.

At the end of my last post about attempting to curate the open data from New York Public Library’s What’s On the Menu? project, I described how I was using a Python client library to communicate with the Open Refine server as a way around the tendency of Refine’s standard graphical user interface (GUI) to crash when attempting to facet and cluster the values for dish names in the nearly four hundred thousand row data set. Using this method, I’d identified about twenty-five thousand clusters of potentially duplicate name values that were candidates to be “cleaned up” as part of curation. I promised to report on how my hybrid client-program-and-GUI-interface was working for this task.

For me, the overall goal of this data curation work is to find ways of making the data from What’s on the Menu? more valuable for investigating historical questions about the cultural role of food.

Like the earlier posts in this series, what follows will be a mix of technical notes (what technologies, workflow steps, fateful decisions) and any accompanying conceptual insights about data curation that strike me as I explore. Working with the data from What’s on the Menu? gives me a chance to experiment with a bit of “hands on” data curation. These write-ups serve as a kind of “think aloud” protocol for that process. Some meandering is to be expected. I hope I have given enough signposts that readers uninterested in one component or another can skip ahead.

From Refine to … Pandas?

Reader, I abandoned Open Refine.

For problems of the right size, Refine can be used to get a lot of work done without the need to program or to know how to program. I recommend it enthusiastically and we introduce participants in our Digital Humanities Data Curation Institute workshops to Open Refine as a great data curation tool.

However, under my hybrid approach, I wasn’t gaining much by using Refine and most of what I would lose in moving away from Refine I could quickly and easily re-implement in native Python. Part of the power of Refine resides in precisely the fact that it offers a GUI for performing powerful and flexible data cleaning tasks. Since the size of my data set was overwhelming the GUI, I was writing code to do what I wanted anyway. Therefore my solution was not benefiting from the “easy-to-use”, no-programming-required quality of the tool.

(In fairness to Open Refine, a functional GUI within a web browser for operations on tabular data of this size—a few hundred thousand rows—appears to be an unsolved problem. GitHub, which now allows users to display and edit CSV files online, punts on the problem of presenting data sets this size, saying “We can’t show files that are this big right now.” Microsoft Excel shudders a bit working with the larger CSV files in the data set but remains functional so I wonder if something about “web stack” UI code is the problem—in Refine’s case there’s no network load to consider since the server is on the same local machine. But I’m no software engineer and I digress …)

For the problem of normalizing a few hundred mostly-unique names of dishes transcribed from historic menus, continuing with Refine was costing me more than it was benefiting. Since I was scripting the Refine server, I ended up doing a lot of packing and unpacking of data structures—from the JSON requests and responses the server produced to Python objects created by the client library to my own improvised data structures and back again—all to accomplish any data cleaning task. My workflow using the Refine client library was to: load the data into Refine and make some simple global changes (stripping extra whitespace, lowercasing, etc.) directly in the GUI, then, proceeding with the (command line) Python script, cluster the data in the “name” column, and manipulate the clusters of potential duplicates independent of Refine until it was time to formulate a kind of patch and normalize the clustered values I’d reviewed “out of band.” I could check my work in the GUI afterward.

It's possible to write your own client application that can communicate with the Open Refine server. But is doing so worth it?

In this scenario, Refine functioned as an in-memory data store and a way of executing one method (the fingerprint clusterer) as a remote procedure call. Perhaps this scripted workflow could have been finessed but it would always be a workaround. By sticking with Refine I was giving up not only the built-in GUI interface but also libraries and other tools in Python that were specifically intended for the kind of task I was doing—the worst of both worlds.

A little reading of the Refine documentation and some helpful comments in the source code for clustering function revealed that it wouldn’t be too hard to reimplement the basic fingerprint clustering algorithm. Removing the need to depend on the Refine server for that one remote procedure call was the last impetus to switch tools. In switching tools, I was free to move from a general-purpose tool (Refine) to a tool that I could shape exactly to the needs of my particular task.

I decided to see whether the Pandas Python library could help. Pandas is a Python data analysis library that is increasingly popular for scientific computing. Furthermore, the library integrates nicely with the IPython notebook, which has become my go-to environment for coding tasks. This move allowed me to take advantage of data structures and associated tools designed for working with large amounts of data (much larger than my dataset).

Before I describe more of the technical details of this new approach, I want to reflect a little on one of the other functionalities I was given up in abandoning Refine: a robust logging mechanism. Poorer logging capability may not seem like much but this technical decision suggests some conceptual distinctions.

The Relation of the Curation to the Data

It turns out that I didn’t need Open Refine to cluster the values for dish names and identify duplicates—but what I will miss is Refine’s logging capabilities, which underlie the “Undo/Redo” functionality. Unpacking the reasons why I’ll missing Open Refine’s logging capability made me think about the relationship between the “raw” data that NYPL publishes and any curated data I might produce. Specifically, I had to admit that what I’m working to produce will not be a cleaned or processed version of the data set but will be more like a supplementary information resource about the original dataset that helps organize and connect it to other resources (one hopes). This distinction is worth unpacking a bit more.

Perhaps in some cases, what happens to data as part of curation is a progression through increasingly refined and improved iterations but the experience of working with the menu data complicates this commonplace (?) assumption. In my head, I’ve started to describe what I’m doing with the NYPL data as creating an index (loosely akin to back-of-the-book indexes) to the original data set rather than a new version or edition of that data. If we’re talking about data curation in terms of producing versions of data we might draw on scholarship from bibliography, textual criticism, and book history to inform new practices. If we’re talking about data curation in terms of producing new information resources about data sets then we may draw more from information science scholarship on cataloguing, abstracting, and indexing (Cf. Ron Murray’s discussions of the “Graph Theoretical Library”). Of course, both kinds of curation may be happen over the entire lifecycle of maintaining data’s “interest and usefulness to scholarship” (Cragin et al 2007). What changes is the relation between “the curation” and the data.

This potential nuance arose too late to address in my recent piece on “Data Curation as Publishing for the Digital Humanities” but it merits further elaboration in that context and I remain interested in what the “bibliographical imagination” [paywall] has to say that might inform data curation work. I don’t want to belabor a minor point. A comparative examination of data curation in different domains would probably show a healthy diversity in the kind of relationships that subtend between less-curated and more-curated data sets. A vision of successive, increasingly-perfected-and-enriched versions of data marching into the future is perhaps only a lazy mental shorthand for “data curation.”

The Log as a Representation

What does this have to do with logging? In my case, grappling with whether and how to keep a log of the actions I was taking on the dataset (something that Open Refine provides but my bespoke solution would not) became a spur to think too about provenance and the distinctions between curation and preservation.

In the midst of considering this question, I happened to read a blog post from an unrelated context—software engineering for real-time distributed systems (in this case, the professional networking site LinkedIn)—that suggested some of the conceptual stakes to the question of whether and how to log changes. That post, by one of LinkedIn’s principal engineers, is a dense and fairly technical paen to “logs,” which the author defines as “perhaps the simplest possible storage abstraction. [The log] is an append-only, totally-ordered sequence of records ordered by time.” One of the most salient aspects of logs for the purposes of data curation is nicely summarized in a shorter post by the Head of the Data Engineering team at Etsy:

The point of [the LinkedIn] post is that a log containing all of the changes to a data set from the beginning is a valid representation of that data set. This representation offers some benefits that other representations of the same data set do not, especially when it comes to replicating and transforming that dataset.

For a data curator, one benefit of a log as a representation of a dataset (which the LinkedIn engineer does not mention because it’s not in his use-case) is that such a representation is well-suited to documenting the “provenance” of data. The notion of provenance is a big deal in archival theory (certainly one of data curation’s progenitor fields) and is defined, according to the Society of American Archivists' Glossary, as “a fundamental principle of archives, referring to the individual, family, or organization that created or received the items in a collection” (and accordingly archivists' obligation to maintain evidence of this context). So “provenance” often refers to both the original context of records and the chain of evidence demonstrating the same. The emphasis on provenance relates to the original mission of archives. As Lorraine Daston pithily explains in a fascinating essay on historical consciousness in science [paywall]: “Early modern archives were bastions of authenticity, places of proofs and pedigrees” (171). The weightiness of these social functions of early archives still trails along behind the concept of provenance like an ermine hem. Many digital preservation systems and standards, attempting to live up to this responsibility perform some kind of logging or provide mechanisms for recording of all the actions that affect data. I think it’s important to note that this kind of tracking and logging does not fully encompass both senses of archival provenance above—perhaps why the PREMIS preservation metadata standard refers more modestly to “digital provenance.” Nonetheless, giving great weight to provenance or even digital provenance might encourage us to think primarily about curation in terms of delineating and preserving versions of data.

In trying to square the notion of curation work that does not produce a (better) version of the NYPL data set but rather an information resource about the original, it is helpful to reassert that archival concerns are but one (surely vital) subset of curation. Librarians and other cultural heritage professionals, in addition to performing vital preservation functions, have “curated” in other ways for a couple hundred years—normalizing and increasing the value of information by building catalogs and indexes. Curation need not always be logged.

To put this question back into practical terms, I wondered how, without a log or some similar mechanism to track changes, I would best be able to submit the results of my work back to NYPL. And if I’m not even any longer working toward a cleaned up version of NYPL’s data set (without yet giving up hope of being useful)—what then?

The question cuts both ways. What value would an information resource about a messy data set hold and what value will the original transcriptions have if there is some more “correct” data somewhere else? In the same essay mentioned above, Daston describes how research on the ways that early modern “scientists” made use of archives suggests that one way that these scholars regarded archives and libraries was as “provisions laid up for future inquirers” (surely resonant with our current understanding of data curation and its aims). This sense was particularly acute with regard to observational data (as it still is). Daston tells the story of how French astronomer Jacque-Dominique Cassini IV (great-grandson of the more-famous Italian astronomer for whom the unmanned spacecraft conducting the most in-depth study of Saturn is partly named) recommended that Académie Royale des Sciences keep a few sheets of astronomical and meteorological data produced by an earlier scholar even though they were “in rather bad order” (172). One wonders how the archivist (?) on the other side of this appraisal conversation felt about Cassini’s recommendation. “Even rough, badly made observations were considered valuable enough to be preserved for posterity,” after all, Daston writes, “who knew when an apparently trivial or even sloppy observation might turn out to be invaluable?”. For Daston, this is suggestive of how “early modern empirical inquiry was an archival science” but, more generally, I think I think it says something about the relationships between original (or at least earlier) data and curated data.

There are messes in the data from What’s on the Menu? and there are insights to be gained from cleaning them up—I’ll some show of these below—but there is also value in curatorial actions that expand the field of view beyond the single node of an original data set to a network or graph of representations, many of which may be new resources about the original data rather than versions of it. This second type of data curation is one that librarians may, in fact, be better equipped to accomplish than the original data curators or domain experts. Thus, in working with the NYPL data it became less urgent to “track changes” and thus the log as a representation of this particular data set diminished in importance for my use case.

Dataframes

That’s the theory but at the end of my last post it was not clear whether I had the (right) tools to do much curation in practice when working with a data set that, while not “big”, was no longer of a trivial size. In Pandas, the Python data analysis library, I think I have found some excellent tools for the job. I want to show a little of how these data analysis tools are useful for data curation.

Pandas works better as a computational tool for my use case than Open Refine because it is based on a more powerful abstraction for tabular data. Specifically, Pandas is built on a foundation of n-dimensional arrays, which some of us may have encountered before in high school math class. (There is however no math required to follow this discussion.) One advantage of using arrays as the underlying abstraction is that this makes it possible to perform operation on whole blocks of data very efficiently. Pandas builds on the array from Python’s numerical (read: statistical) computing library to (as it says on the tin) “provide sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data.” The main mechanism for accessing these additional capabilities in Pandas is via the DataFrame object (similar to the R language’s data.frame). For more detail, I recommend the book, Python for Data Analysis by Wes McKinney (source of the previous quote).

The "golden-tailed tree shrew book" from O'Reilly Media. Maybe not as catchy as "the polar bear book"?

By using a tool designed for high-performance analysis of large data sets, I gained speed and flexibility in manipulating the data from What’s On the Menu? even though the data in my case was not primarily numerical. At the most basic level, I could increase the amount of data I had loaded and available for work. When I was using Refine, each comma-separated value (CSV) file became a separate “project” and I was limited to working with one at a time. Also, the DataFrame is computationally efficient so the larger files in the NYPL data set no longer posed problems. The CSV file that maps dishes to the positions where they appear in the images of the scanned menus is a hefty file (113.6 MB in the latest download) and Refine is basically unable to load this file into memory. Furthermore, in Pandas I could recreate more of the linkages between related data in the various files. The CSV files in the NYPL data set are created from relational database tables and they contain keys that are intended to allow them to be linked together. Mimicking a relational database is obviously not what Refine was designed for but Pandas does provide SQL-like functionality that allowed me to take advantage of these linkages. Rather than being limited to one project representing one CSV file, I could quickly and easily load two or more CSV files into Pandas DataFrames and work with both or several at the same time.

Many operations which I formerly had to perform myself are simply built in to Pandas. A good example relates to doing basic inspection of the data set for quality issues. Under my old method, to see how many unique values there were in the “name” column of the data about dishes, I had to either call Refine’s faceting function (and wait) or loop through the various rows of a CSVReader object pushing each value of name into a dictionary as keys (which must be unique). After lowercasing or stripping whitespace, I’d do it again then see if the number of keys decreased as duplicates were found. In Pandas, I just call the “unique” function on the name “column” and get a count. As I learned to take advantage of the functionality Pandas provides, I began to find numerous examples like this.

I did have to learn to think about how to accomplish my goals in a more functional rather than procedural style. Writing code that’s intended to loop through a bunch of rows (as though one were manipulating a CSV) basically vitiates the advantages of having data in a DataFrame. The “Pandas way” is to perform operations on all of the data in an array at one go. So, with a little mind-bending, I learned to think in terms of map-and-reduce paradigms for applying functions to data and to understand the power of “GroupBy.”

For those interested, I put together an IPython notebook demonstrating some uses of Pandas for clustering and finding duplicates in the dishes transcribed by the NYPL’s online collaborators.

At this point, the reader could be forgiven for asking (maybe not for the first time): so what? Pandas DataFrames are nifty and powerful but what’s the payoff? To wrap up this post, I want to show some small results and suggest how to generate more and larger ones.

How many French Fries?

All the clues that there is room for improvement in the What’s on the Menu? data can be found on the page for any dish, under the heading of “Related dishes”:

The "Related dishes" in the sidebar on the right look very closely related indeed.

These are not so much related dishes but un-normalized duplicates of what appears to be a single dish type. The duplication in the data set undermines the ability of interested users to view and browse meaningful patterns of co-occurrence among dishes. By extension, the ability to trust the summary information about first and last date of appearance, highest and lowest price, and total frequency of appearance over time is all thrown into doubt. In the accompanying IPython notebook, I used the case of “French-fried potatoes” to demonstrate how cleaning up duplicates allows us to revise this potentially analytically significant information. Instead of appearing on 1,321 menus, when all the variants are counted together, “French fried potatoes” appears on 2,140 menus. In comparison with the data on the page for Dish 1259 (above), the date range for the aggregate “French fried potatoes” appearance is very similar: 1884 to 1989 (instead of 1987). But, for other variants in the long tail of “errors,” e.g., Dish 14987, the changes are more dramatic—23 appearances to over two thousand, a date range that expands from 1896 to 1969 to the full 105 years above. For the rest of the twenty-five thousand sets of duplicates identified by the clustering algorithm it’s possible to go through a make similar adjustments. The analytical payoff of course is not in the these individual changes but in the overall changes in food trends that appear.

To realize these gains it is sufficient to able to point at slices of the original data—the changes do not necessarily have to make it back into NYPL’s systems. The ability to perform a kind of non-custodial curation is a powerful benefit of the NYPL’s design decision to assign URLs to every part of the data. Thus, it is possible to express the curatorial work on the data set as linked data embedded in simple HTML. In terms from earlier in this post, the curatorial product can be a new information resource that points (using linked data standards) into the original NYPL data. Other agents on the web can then come along and take not only the original NYPL data but also, if they wish, the “corrections” we have made to that data without any need for pre-coordination. I’ll give one quick example of one form this might take. For the duplicates in the case of “French fried potatoes” I could publish a bunch of statements of the following form (here using schema.org microdata for clarity):

<div itemscope itemtype ="http://schema.org/Thing">
    <a itemprop="url" href="http://menus.nypl.org/dishes/1259">French Fried Potatoes</a>
    <a itemprop="sameAs" href="http://en.wikipedia.org/wiki/index.html?curid=10885">French fries</a>
</div>

In this case, rather than asserting that all the URI’s in the “French fried potatoes” cluster are the same as each of all the others (very verbose), I am asserting that all of these dishes are the same as the dish identified by the Wikipedia URI. (The nuances of using “sameAs” for complicated historical information are fodder for another post.) The other benefit realized here is that it’s not strictly necessary to fuss over which string is the “best” representation of the French fried potatoes dish (so as to be able to change all the other values of name to that particular string). As a first pass, I can simply select the most common name value in a cluster and use that as the human-readable name for convenience. The ability to reliably reference sets of duplicates by their URI’s is unaffected.

Is There More to Do?

In a future post, I want to talk more about the idea of non-custodial curation and how the availability of the NYPL data on the web makes new experiments in this direction possible.

New Publication in the Journal of Digital Humanities 2.3

2013-12-17T00:00:00+00:00

Refining the Problem — More work with NYPL's open data, Part Two

2013-08-19T00:00:00+00:00

In my last post, I described a speculative approach to the data generated by the New York Public Library’s What’s on the menu? project with the aim of identifying some data curation activities that might improve the usefulness of such an open data resource. After briefly assessing what I was seeing in the downloadable data set, I decided to work towards normalizing the names of the various “dishes” in the system. With this post, I want to talk about how I’m accomplishing this using Open Refine. What I’m describing here is not the construction of a tool or even a particularly robust workflow but hopefully a deepening of the exploration.

Strings are not what they appear

The comma-separated value (CSV) file on “dishes” in the downloadable data set contains around 395,000 rows. Each row contains information about one “dish” that users of What’s on the menu? have transcribed from the digitized historical menus on the library’s site. For my immediate purposes here, what’s most significant is that the data appears to contain an identifier (e.g., 470291), from which an HTTP URI for the item can be straightforwardly derived (http://menus.nypl.org/dishes/470291), as well as a string purportedly representing the name of the dish.

I say “purportedly” because the only evidence we have that the data in the second column is a name is the label of that column. (We’re also assuming that one row in the CSV file maps to one entity.) Assuming data to be exactly what it purports to be is a shaky proposition. The values in the “id” column check out as identifiers when we test them by constructing HTTP URIs and requesting the designated content. We can check the values in the “name” column by inspecting a few of them to see whether they conform to a colloquial understanding of names. We can also examine the transcription interface where these values were collected/created to try to assess whether anything there clearly maps to a concept of names. First, in addition to values like “Glace Framboise,” which seem relatively straightforward, we also have as values of “name,” strings like “ham steak, glazed pineapple rings, sweet potatoes, timbale of spinach a la financière.” The concept of “dish” is accommodatingly loose but there is some ambiguity whether this is one dish or two or three or four. Second, we can see that the concept of “name” does not appear in the transcription interface. Volunteers are shown a section of the digitized image and given a text field with the instruction: “Write the dish here exactly as it appears. Don’t worry about accents.”

Screenshot of the transcription interface from What's on the menu?

The instruction to “write the dish” sidesteps tricky decision-making that might lower participation—volunteers only need to reproduce the string of characters in the image exactly as it appears (this is not as simple as it might seem). The designers of What’s on the menu? take the results of this human computation and store it in a field labeled “name.” This undocumented conceptual leap from “transcribing” to “naming” explains some of the quality issues with the data. Based on these two checks, we can’t assume that the values in the “name” column are all, in fact, names of dishes.

Of course, the New York Public Library (NYPL) Labs’ team is well aware that transcription interfaces need to be designed to scaffold complex tasks like the creation of high quality structured data. What’s on the menu? was the first of a string of successful projects. A more recent project using historical theater programs shows how NYPL Labs have designed a different interface to support volunteers in creating data about entities via transcription of images. The transcription interface of Ensemble asks volunteers to first decide “What type of info is this?” then displays relevant fields like “name” (explicitly labeled) based on the answer. In this later project, the data collection interface corresponds much more closely to the entity model represented in the underlying database fields. The greater sophistication of the transcription platform in Ensemble helps to bridge the gap between transcribing from images and creating data about entities. NYPL has improved the algorithm at work in their human computation.

Screenshot of the transcription interface from Ensemble

Understanding how the data was created helps the prospective curator to realize that the data set from What’s on the menu? comprises not structured information about entities like “dishes” but rather partially-processed observations (transcriptions) of regions of high-resolution imagery. (Indeed the largest data file in the downloadable set is a mapping between “dish” identifiers and regions of images of particular menu pages identified by x- and y-coordinates.) Taking the labels of fields and columns in “Dish.csv” as authoritative would assume a transformation of the data from one state to another (observation of image to property of entity) that has not actually occurred (or at least, has not occurred purposefully or uniformly throughout the data set). Getting to an authoritative index of “dish” entities will require further curation.

Space to improve

Knowing that we’re dealing with “observational data” in the form of transcriptions should incline us to treat the values of “name” skeptically. Also this knowledge does much to bolster the original assumption that the size of the curation challenge is not 1.2 million dishes to normalize as the What’s on the menu? site proclaims, or even 395,000-odd dishes as the number of rows in the CSV file would suggest, but some smaller proportion of even that smaller number.

Since the first step towards an index of dishes will involve cleaning up the variant strings found in the “name” column of “Dish.csv,” I turned to Open Refine, which proclaims itself “a free, open source, power tool for working with messy data,” as the most obvious fit for the job. A common workflow for using Open Refine (hereafter just “Refine”) to clean up messy data involves loading in data, optionally doing some global transformations, then using the tool’s powerful clustering functionality to group very similar values that may in fact be duplicates even if purely algorithmic processes can’t definitely identify them as such. To start, I’ll follow that general pattern.

Normalizing values using transformations will probably turn up some additional matches even prior to trying the clustering methods. These are basic things like trimming leading and trailing spaces and collapsing extraneous internal spaces (between words) but even small whitespace variations prevent the computer from making exact matches between strings. Transformations are available in the dropdown menus in the header for each column.

Accessing the common transformations from the Open Refine menus.

I also transformed all values to lowercase to eliminate variations due to irregular or inconsistent capitalization. After just these procedures, the percentage of duplicate values for dish “name” rises from 0.009% (hardly worth expending effort to fix) to a little over 6% (perhaps small in itself, but the leap suggests that we’re a long way from the ceiling for improvement). In specific terms, this means that there are actually thirteen different identifiers assigned to the dish “cold roast beef” based only on small variations in white space and capitalization not any genuine ontological difference.

Powering the power tool

At 395,000 rows, the What’s on the menu? data set is substantial but can still be opened in a common application like Microsoft Excel. However, a data set of this size, with so many potentially-unique values does pose challenges for Refine when trying to use more powerful functions like clustering. To get the benefit of Refine’s capabilities for cleaning up messy data, I needed to be able to interact with the application programmatically rather than only through the standard graphical user interface (GUI).

The first step in clustering involves generating what Refine calls “facets”—a list of the unique values for “name” found in the data set. Right away we run up against apparent limitations of the tool. After churning away for a couple of seconds, Refine reports that there are “370004 choices total, too many to display” and offers a link where we could “Set choice count limit.” This is frustrating behavior—we can’t see even a partial list of facets and it’s not immediately obvious how to make forward progress from this point. If we click on the link to set a higher choice limit count—up to 371,000 from the default 2,000, to accommodate the variation in the menus data set—Refine will obligingly attempt to calculate and display this many facets. Most likely, the browser page will become unresponsive and raise an error. In modern browsers, processes in different tabs are isolated from each other so only the Refine tab should crash but, in older browsers, the whole application may crash at this point: caveat emptor.

The sidebar where we hoped to see a list of facets in the menus data.

With only a few other browser tabs open—rather than the usual tens—this process actually succeeds on my machine after a minute or so. Yet even if the faceting process succeeds, the result slows the interface to a sluggish crawl and clustering still crashes the whole process. In a section titled, “Miscellany” under the help documentation for the faceting function, the maintainers indicate that you can raise the choice count limit but only “if you think your computer can handle it.” More interestingly, the help documentation goes on to say, that “whether ‘your computer’ can handle it or not depends mostly on your web browser.” I’m using Google Chrome on a fast machine so clearly, for the menus data, raising the choice limit count so far above the default is an unworkable hack. Nonetheless, this particular failure of the “power tool,” with the blame falling on the web browser, contains the seed of a “better” workaround.

This better workaround is predicated on understanding how the Refine application works. Refine uses a client/server application architecture but both the client (the web browser window as GUI) and the server live on the user’s computer. The server component is written in Java, while the front-end interface is written in Javascript. What’s failing when we try to cluster the values from the “name” column in the menus data is not the backend server application that is computing the facets but rather the Javascript application that manages displaying and updating the information. So, I reasoned I might be able to work around the frustrations of the current GUI, if I could just interact programmatically with the server component. There are links to old discussion forum posts as well as a number of actual client libraries in various programming languages for interacting with the Refine server in the official documentation. Often the motivation for “scripting” an application like Refine is to enable it to be used in “batch” mode—that is, without human intervention. That’s not my motivation here. As the documentation points out, clustering, which relies on human value judgements for merging very similar data values, can’t be done in this kind of batch mode.

Since I was going to have to go to the trouble of this workaround for the standard Refine interface, I did ask myself whether I still needed Refine. The string manipulation above could have just as easily been accomplished using the basic libraries of almost any programming language. A little digging around in the source code and some strategic googling revealed enough to quickly re-implement the main clustering method. So, I could have written more code and ended up with the same functionality I’ve gotten out of Refine so far. I suspect that down this path lies more and more re-inventing of the wheel. For this little speculative project, I am content to have both the standard browser-based interface as well as a programmatic interface open at the same time.

Clusters on command

I’ll discuss what this process says about the potential for curating the data from What’s on the menu? but I want to offer a few more details about working with Refine. This section may drift into technical detail but the client libraries that can be used to programmatically drive the Refine server are very poorly documented.

Based on what I read in the discussion forums and on a “sniff test” of the few source code repositories, I settled on using one of the more recently updated forks of the Python Refine client library. These client libraries do not work with official APIs. As the creator of this Python library explains, these libraries reverse engineer the Refine application by snooping the traffic between the browser-based client and the local server using tools like HTTP Scoop. As I mentioned, documentation is very scant, and, at least in the case of the Python client, the code is not very idiomatic, which makes it a little more challenging to decipher. However, it works just fine for the purpose of patching around the problem of overloading the GUI with too many facets.

The Python client library assumes that Open Refine is installed and the server component is running at the default address ( a different value can be passed in at initialization if necessary). Then, a couple lines are sufficient to setup a connection to the Refine server.

from google.refine import refine, facet

server = refine.RefineServer()
grefine = refine.Refine(server)

Most of the ‘commands’ return the raw JSON output that the server sends back. So, for instance, once I’m set up I can list the projects in my copy of refine—in this case just one comprising the August 1st version of the data set from What’s on the menu?

{u'2310205155087': {u'created': u'2013-08-16T20:45:49Z',
  u'customMetadata': {},
  u'modified': u'2013-08-16T20:56:56Z',
  u'name': u'2013_08_01_07_05_00_data'}}

Figuring out how to do what I wanted (faceting and clustering on the values of “name”) took some experimentation—particularly in figuring out how to pick my way through the objects returned by some of the functions. For instance, to inspect the facets of the data, I had to get access to a dictionary called ‘choices’ that is part of the response to the ‘compute_facets’ function:

facet_response = nypl_dishes.compute_facets(name_facet)
facets = facet_response.facets[0]

for k in sorted(facets.choices, key=lambda k: facets.choices[k].count, reverse=True)[:25]:
    print facets.choices[k].count, k

This produces a list of unique values and their associated raw counts:

13 potatoes hashed in cream
13 cold roast beef
11 club sandwich
10 lobster salad
10 hot roast beef sandwich
10 american cheese
10 clams: little necks
9 celery
9 american cheese sandwich
9 strawberry ice cream

I created a quick iPython notebook to demonstrate the details of how I’m using the client library to drive Refine.

Return on investment

The main thrust of this post has been demonstrating a “how to” for working around some of the limitations of Open Refine for cleaning and reconciling data like that from What’s on the menu?—where the size of the data set is in the hundreds of thousands of rows and where there is enough variation that the standard browser GUI cannot handle the load. The larger question is whether there is a still a plausible vision for how a data curator could add value to this data set. The need to script around limitations of a tool increases the cost of normalizing the NYPL data. At the same time, the ability to see the clusters of similar values that Refine produces increases my confidence that the potential gain in data quality could be very substantial in going from the raw crowdsourced data to an authoritative index. Using just the default method (usually the most effective), produced 25 thousand clusters that need to be evaluated and reconciled. In future posts, I’ll report on how well the combination of the standard graphical interface and programmatic control are working to help me improve the NYPL’s data.

What IS on the menu? More work with NYPL's open data, Part One

2013-08-08T00:00:00+00:00

Since I started teaching short courses on humanities data curation on semi-regular basis (first as part of MITH’s digital humanities training institute and then as part of the Digital Humanities Data Curation institute), I’ve been looking around for suitable hands-on exercises to help people “get a feel for” different aspects of the work involved in curating data in a humanities context. Maintaining the usefulness of data to researchers can involve planning, describing, building collections, and even tasks that shade into digital preservation like migrating data to new media. Curation can also involve “cleaning,” normalizing, reconciling—what we might call “munging” data—probably (hopefully?) for the purpose of creating better search, retrieval, or indexing. The open data generated by the New York Public Library’s What’s on the Menu? project has been a great testbed for experimenting with these latter kinds of practical curation work.

Lydia Zvyagintseva, a Master’s student from the University of Alberta visiting MITH for a practicum, did great work exploring possibilities for how additional, curator-generated facets for browsing data about events and locations could add research value. In the most recent Digital Humanities Data Curation workshop (which I am fortunate to co-teach with Dorothea Salo and Julia Flanders), we worked through exercises on significant properties and potential user needs related to the menus data. In this post, I’ll describe some more exploratory work I’ve done since the most recent workshop.

Beyond Access and Preservation

What’s on the menu? is a useful curation testbed because the New York Public Library (NYPL) has already done a generally excellent job providing access to the data. The menu transcription project has been running since 2011, and it has been a wild success. Volunteers have used the site to transcribe almost 17,000 digitized historic menus (as of the time of this post). NYPL makes all the menu data available for bulk download and also provides an application programming interface (API).

Front page of What's on the Menu? (as of August 2013).

Beyond this impressionistic sense of “good access” to the data, we can evaluate the NYPL’s arrangements for access according to a commonly-used quality measure like Tim Berners-Lee’s 5 Star Linked Open Data scale. The What’s on the menu? data set scores fairly well by this measure—somewhere between 3 and 4 stars on the 5-star scale. The data is available on the web in a machine-readable, structured, non-proprietary format (stars 1-3). The criteria for the first star also specifies that data should be distributed with an open license and here we could quibble a little with the existing provisions for access. The “Data” page at the What’s on the menu? site states that there are “No known copyright restrictions on this material” but asks those who use the data to “credit The New York Public Library as source on any applications or publications.” The phrase “no known copyright restrictions” echoes the language of the Public Domain Mark suggested by Creative Commons but it’s not entirely clear that NYPL’s intent with this data is wholly the same as that underlying the Public Domain Mark. Perhaps formally using the Public Domain Mark would help clarify that this is truly open data? (I offer this suggestion tentatively because I know that NYPL has some very good copyright advisors and so on the whole, I think we can give What’s on the menu at least 3 open data stars.) NYPL also uses HTTP URIs as identifiers for things, which is part of the criteria for 4-star linked open data, but the data is returned via the API in either JSON, or (a custom) XML rather than using the most-W3C-blessed standards (RDF and SPARQL). For example, I can get back data about a particular dish by sending a request to the URI that identifies it (e.g., http://api.menus.nypl.org/dishes/1860):

{
    "description": null, 
    "first_appeared": 1887, 
    "highest_price": "$0.85", 
    "id": 1860, 
    "last_appeared": 1989, 
    "links": [
        {
            "href": "http://menus.nypl.org/api/dishes", 
            "rel": "index"
        }, 
        {
            "href": "http://menus.nypl.org/api/dishes/1860/menus", 
            "rel": "menus"
        }
    ], 
    "lowest_price": null, 
    "menus_appeared": 98, 
    "name": "Brussel Sprouts", 
    "times_appeared": 98
}

The highest 5-star rating would apply to data that includes links to other datasets. One objective of further curation work might be to discover and contribute links between the NYPL data and other open data sets to create something like the concordances that the Cooper Hewitt Labs have built for entities in their collections.

In many data curation scenarios the most urgent tasks involve moving data from the original site of creation to a stable environment (like a repository) where it can be preserved, but also where it can be reliably accessed. Neither of these problems is at issue with the menus data, so the potential curator can consider what other activities might improve the usefulness of the data.

What’s a Data Curator To Do?

Looking just beyond the (valid, important) tasks of preservation and basic access that are currently occupying many academic libraries entering the realm of data curation, interesting additional possibilities emerge for constructing what the work of data curation can be. I’m particularly interested right now in work that data curators can do to build secondary and tertiary resources—reference materials, if you will—around data. I mean particularly reference materials that draw on the skills of people with training in library and information science, things like indexes. These types of organized systems of description can be one way to provide additional value over full text search (which, for many kinds of data sets, e.g., a table of numerical readings, is not particularly effective anyway).

How might this apply to the data from NYPL’s menu transcription site? For this exploratory data curation exercise, I’m setting myself the goal of seeing what can be done with the names of various dishes in What’s on the Menu? (surely, one of the main points of interest in this data set). The end product I’m imagining is a good index to the dishes represented in NYPL’s collection of menus. We could have an “authorized form” for each dish, keep track of any alternate forms, and begin to work out categories of related dishes. From this, we could make some headway toward producing linked data from the menus data set—via concordances like the Cooper Hewitt’s—and we could also make our index of dishes available to others as a reconciliation service for cleaning and normalizing other data sets (using tools like Open Refine and the Open Knowledge Foundation Labs’ Nomenklatura).

NYPL has a data set that scores very well on an established scale of openness—the library provides access to machine-readable, structured data in a non-proprietary format—but further curatorial work can still improve the usefulness of this data by ordering and systematizing it at a layer beyond the technical structure of file format. The reason additional curation is needed has to do with the difference between strings and ‘things.’

‘Strings Versus Things’

Advocates of linked open data often use some variation of the phrase ‘from strings to things’ in order to convey the basic motivation behind the technology. A Google search will turn up numerous examples. See, for example, this talk by Mia Ridge at a Linked Open Data in Libraries, Archives and Museums (LODLAM) workshop from last year (2012). As Ridge explains,

Computers think in strings (and numbers) where people think in ‘things’. If I say ‘Captain Cook’, we all know I’m talking about a person, and that it’s probably the same person as ‘James Cook’). The name may immediately evoke dates, concepts around voyages and sailing, exploration or exploitation, locations in both England and Australia… but a computer knows none of that context and by default can only search for the string of characters you’ve given it.

An inspection of the What’s on the menu? data set shows that we’re working with strings. A search for “Brussel sprouts” returns 611 results including at least 3 that look nearly identical but have different counts for the numbers of menus on which they appear. For our reliable index of dishes we want to be working with things (where we can leverage those nice HTTP URIs to convey our specific meaning in machine-parseable terms). In the era of Google, this type of variation and duplication in search results is something to which researchers are accustomed and perhaps it even re-introduces a kind of serendipity, however, this feature of full-text search, which operates on strings, does make asking other questions of the data more difficult.

To appreciate the effect that going from strings to things can have, I extended my method of inspection from a single search to the whole data set. The front page of What’s on the menu proclaims that 1,260,150 dishes have been transcribed to date (this was in late July so slightly higher now). A quick look at the downloaded data, suggests that it might be more accurate to say there are 1,260,150 instances of dishes in NYPL’s system. There are only ~~469,357~~ 394,871 entries (rows) in “Dish.csv” (again, for the July data)—each one representing a “dish” that has been given a unique identifier. (To quickly check myself, I looped through the CSV file and totaled up the values from the “times_appeared” column. The result—1,257,525 dish instances— is close enough to the published value to confirm my assumption.) So, really we have 1.2 million instances of ~~469K~~ 394 thousand types of dishes. Given the example of the Brussel sprouts, I suspect that the number of types is actually lower still. ~~469K~~ 394 thousand “dishes” is large enough to make for an interesting challenge but curation of this data to create a reliable index is only half as big a job as it appears from the web site.

Cracking open the data set and inspecting it is one way of assessing the need for curation and the likely amount of effort required—at ~~469K~~ 394 thousand data “points” or even 1.2 million the data set is small enough to do this without stretching common workflows or computational tools. (You can open “Dish.csv” in Microsoft Excel, for example.) You could make a similar determination about the curatorial actions needed and the rough scale of the challenge without opening any of the data files.

Part of the basic conceptual equipment of data curation (as a meta-discipline) is a rough taxonomy of types of data: observational, experimental, simulation, etc. Data curation researchers have also developed some cross-cutting ideas of “data levels”—from more “raw” to more “cooked.” These terms come from a techno-scientific context (data levels developed in the context of work with earth-observing satellite imagery) but we can also use them to reason about humanities data like that from What’s on the menu by thinking about the project like a system.

The NYPL has imaged a collection of physical objects producing a first level of data (though not really “raw” in any deep sense, cf. “Raw Data” Is an Oxymoron). Then, through the construction of the What’s on the menu? site/application, the Library processed this first level of data with the aid of online volunteers (another term for “crowdsourcing” being “human computation”). What we can download as a data set is roughly this second level of data—transcriptions based on the images. Treating the contents of the downloaded CSV files as a kind of partially-processed observational data both helps us estimate error and variation (hello “human error”) and also think about how to plan for transformations and changes to the data set (will “authorized” forms fully supersede original forms?). Thus, we can reason from our theoretical knowledge of data and data curation to guide practical “hands on” action.

Next Steps

In the next post, I’ll describe some of the actual data munging work I’m doing to get closer to an index of dishes for What’s on the menu?. The data set of “dish” names is small enough to open in Excel but is big enough to challenge the normalizing functionalities of the more-powerful Open Refine. I found a way around this bottle neck and fell into a few other useful workflows along the way.

UPDATE (2013-08-17): Corrected the number of rows in the downloaded CSV file.

New Writing in Archive Journal 3

2013-07-12T00:00:00+00:00

In Service? A Further Provocation on Digital Humanities Research in Libraries

2013-06-19T00:00:00+00:00

Data curation as publishing for digital humanists

2013-05-30T00:00:00+00:00

Since I wrote much of the text of the talk I presented at the recent CIC Center for Library Initiatives conference, I thought I would share it here, somewhat edited. (I don’t usually write the “prose” of my talks beforehand.) My original slides are available on Speaker Deck with all of the leaps, shorthand, and repetitions inherent in being a product intended for verbal delivery.

I want to extend my thanks again to all the staff of the CIC Center for Library Initiatives and to the members of the Program Committee for the 2013 Annual Conference for inviting me to speak. The University of Maryland will become part of the CIC on July 1 of this year and attending this conference was a great way to preview some of the exciting work happening in the libraries of the CIC institutions.

When I was asked to participate on this panel about digital humanities and the alternative publishing needs of faculty, I felt obliged to temper my delighted acceptance of the invitation with a caveat that “ scholarly publishing” (and its associated challenges) is not something I usually see as central to the work I do. One of the things I work on is data curation and the thoery and practice of data curation should be relevant to conversations about “emerging options for scholarly publishing.” So, to address the theme of this conference and this panel, I would like to talk about data curation as publishing. The work of curating data—the activities required to maintain the usefulness of information produced as part of research—should be legible as “publishing” work in much the same way that well-understood tasks related to preparing and circulating monographs or journals are publishing work. Data curation as a “publishing” activity is increasingly relevant to the working lives of digital humanities scholars. Moreover, articulating connections between “publishing” and data curation is important in the context of strategic decision libraries might make and, in fact, are making about how to participate in “publishing.” Data curation as publishing is publishing work that draws directly on the unique skills of librarians and aligns directly with library missions and values in ways that other kinds of publishing endeavors may not.

In referring to “data curation” I am speaking specifically of information work that integrates closely with the disciplinary work practices and needs of researchers in order to “maintain digital information that is produced in the course of research in a manner that preserves its meaning and usefulness as a potential input for further research” (Munoz and Renear 2011). Data curation is “the active and on-going management of data through its lifecycle of interest and usefulness to scholarship, science, and education; curation activities enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time” (Cragin et al. 2007) . This distinguishes data curation from many near synonyms: digital curation, digital stewardship, digital preservation. Hopefully, this high-level description of what curation activities enable—discovery, assurance of quality, “added value”, and re-use—also suggests the points of connection (via their similar ends) with other activities that are considered part of “publishing.” I also want to emphasize—for the purpose of making clear what kind of project library-data-curation-as-library-publishing needs be—is that data curation work is active (activist?) and that it is informationist work.

The link between data curation and publishing is not new. Joyce Ray, Sayeed Choudhury, and Mike Furlough presented a paper in 2009, summarizing several strands of contemporaneous work. The paper was entitled “Digital Curation and E-Publishing: Libraries Make the Connection”. [ed. note: To suggest the multiple connections between data curation and publishing, I also pointed to a symposium on the “Now and Future of Data Publishing” taking place at Oxford the same day at the CIC Libraries conference.]

What is new—or at least newer—is data curation (in the sense above) as a part of humanities research. Digital humanists in particular are becoming increasingly aware of data curation issues and data curation needs as part of the way they (we) work. This element of digital humanities work is becoming prevalent enough that I’ve selected examples casually from things that I’ve come across in my professional networks and feeds recently. Lincoln Mullen, a PhD student at Brandeis University, posted on his blog about using the statistical programming language R for historical research. As part of his discussion, Mullen describes how he converted the tables from a monograph he found in his research to a series of comma-separated-values (CSV) files in order to produce graphs and charts of the changing demographics of American religion. Along with his analysis and the blog post about his methods, he posted the (small) data set to Github, a platform for sharing open source software code and open data. Ted Underwood, Associate Professor of English at the University of Illinois, has made the work he and a graduate assistant have done building, cleaning, normalizing, and labeling a data set drawn from the HathiTrust corpus a significant part of the output of his “Uses of Scale” project and other professional presentations. It is also increasingly common to see the release of open data sets as enticement to attract digital humanists to work on particular sets of questions, or in partnership with cultural heritage organizations—see, for example, the IndexCat data from the National Library of Medicine, a small collection of catalog records for a historical library of children’s literature, data from some of the crowdsourcing projects run by the New York Public Library, the Smithsonian Cooper-Hewitt, National Design Museum collection data, and many more examples.

At least part of the professional activity of the digital humanists and organizations above involves making data available and suitable for re-use. As any of the researchers involved would no doubt say, curation of these data sets take time, effort, and money. Libraries getting involved to help digital humanists do this kind of work would be offering something of value. This would be “publishing” not only in the sense of registering and “making public” a product of scholarly work, but this data curation work would also be “publishing” in the sense of ensuring quality and disseminating outputs to interested communities. (Thanks are due to Shana Kimball for prompting this extension of the argument in discussion after my original talk). By recognizing data curation work as a publishing activity, libraries would have a “market opportunity” to address unmet needs in the digital humanities community (among others).

In the paper by Choudhury, Furlough, and Ray mentioned above, the authors describe how data curation and publishing can be mutually-reinforcing activities. They write:

we have on the one hand, a community, or a subset of several communities, that has been working on the “back end” of digital production from the generation of raw data to the construction of an organized product that can be accessed, and, on the other hand, another community—publishers—who work on the “front end” of scholarly communications, from manuscripts to publication.

“Making the connection” involves bringing these communities together as complementary elements of a service portfolio that will help libraries justify their funding and their relevance amid changing scholarly practices. This is a good argument and some innovative libraries (among them Penn State, Johns Hopkins, and Purdue) seem to be having some success with this as a strategy. I would argue that it is possible, even preferable, to treat the connection between data curation and publishing as being more fundamental. Data curation is publishing—a form of publishing especially for digital scholarship—and libraries interested in investing in “publishing” as an innovative activity should take some of the resources allocated for such endeavors and devote them to paying for data curation work.

The discussion of “back end” and “front end” by Choudhury, Furlough, and Ray places the connections between data curation and “publishing” in the context of lifecycle models of data (this is explicit in the paper) and lifecycle models of existing scholarly publications like journals and monographs (from author to editor, publisher to library, etc.). As such, while the mutual reinforcement of curation and publishing is emphasized, the recommendations as to what activities libraries and publishers should undertake are (somewhat disappointingly) familiar. Publishers add value to end products through peer review and high quality production and presentation. Libraries standardize and preserve these outputs and continue to make them available to a community over time. Treating data curation and publishing as kindred services may offer the prospect of expanding a library’s stable of “innovative” offerings while not straining resources because there are management efficiencies in having both the “front end” and “back end” people in the library. However, in this model, neither libraries nor publishing seems truly transformed and this is a problematic mismatch when so many other aspects of scholarly work are being transformed.

So there is a need to step outside the lifecycle model inherited from other kinds of scholarly publications. The products, work practices, and exchanges involved in doing data curation as publishing activity will look different from those involved in other previous kinds of publishing. However since data curation work still fulfills the ends of registering, making public, ensuring quality, and disseminating to potential users, data curation should still be legible as publishing. In thinking of data curation as publishing, it is important to understand that this is not exactly the same as data publication.

In a recent publication in Data Science Journal, Mark Parsons and Peter Fox explore “data publication” as a metaphor for the kind of things that scholarly communities want to see happen with data. They explain that “Data Publication builds from the familiar and conceptually simple model of scholarly literature publication” and they capitalize the terms deliberately to indicate the status of this phrase as “a recognized metaphor and data management paradigm.” Parson and Fox’s paper elaborates on what are some significant problems in adopting this metaphor. In the limited space available I want to focus on just one of these problems. Parsons and Fox note that under the model of Data Publication “publishers are distributed and can act autonomously or in concert.” Thus, they write, “there is … little emphasis on data discovery and interoperability across systems. Data are often presented as they were created without explicit considerations of data integration or significant reuse. … The attention is on preservation and formal recognized scholarly contribution with less attention to … issues such as latency, rapid versioning and reprocessing, and computational demands.” To understand data-curation-as-publishing (which I’m advocating as a way to serve digital humanities scholars) only as “Data Publication” expands recognizable publisher and library activities to a new class of scholarly objects (data) but in many ways perpetuates the (flawed) status quo. Libraries becoming data publishers has many of the same flaws as the model of libraries becoming journal and monograph publishers.

Within the critique of Data Publication there are glimpses of what it could mean to treat the activities of data curation as “publishing” activity in a way that would benefit both scholars and libraries. The first part of Parson and Fox’s critique is that under the model of “Data Publication” there is “little emphasis on data discovery and interoperability across systems.” Various examples from the media landscape suggest the truth of this claim. In the realm of ebooks, the importance of outlets like Amazon and other digital dissemination channels has recently forced publishers to pay greater attention to “discovery” and to devote more resources to things like metadata, but at the same time, the fracturing and proliferation of ebook reading platforms is an ongoing example of problems of interoperability across systems in a publishing marketplace. (There is a similar shape to the story of the relative fortunes of the on-demand video company Netflix and various real or rumoured video platforms implemented by specific studios or content creators.) This leads to the question of whether lack of emphasis on discovery and interoperability are intrinsic to the business of publishing (presumably because the energies of publishers are directed elsewhere to activities considered more vital to mission and survival)? Attention to “discovery” and related issues of interoperability across systems are traditional and persistent features of library work. There are likely to be difficulties in that the library, the more it acts as publisher, might get away from doing the valuable work it has done in the past. The flip side of this point, is the opportunity, expressed in Choudhury, Furlough and Ray’s piece, to excel where traditional publishers have not. However, this alignment, just having both “back end” and “front end” of the process, may not be sufficient to avoid falling into the trap of neglecting discovery and interoperability if Data Publication is the governing metaphor rather than data curation being the predominant action.

This leads to the next part of the critique—that, in a model of Data Publication, “data are often presented as they were created without explicit considerations of data integration or significant reuse.” Data being “presented as they were created” sounds like a description of researcher self-deposit into (institutional) data repositories—currently the most common form of library engagement with data curation. That Parsons and Fox single this problem out in a discussion of why Data Publication is a problematic metaphor from the perspective of solving the real information needs of researchers suggests that while the provision of institutional data repositories is necessary and important it is not sufficient to support scholarship. So, libraries cannot stand pat; they cannot maintain only the “back end” of these processes but must make the connection to more active engagement. Libraries also cannot just adopt a position of becoming data publishers (via repository provision) in the way some are seeking to become journal publishers through the use of platforms like Digital Commons and similar initiatives. Data spread across institutional repositories becomes like a fragmented ebook market spread across proprietary reading platforms.

It is worth noting too that issues that a Data Publication model does not easily encompass—"issues such as latency, rapid versioning and reprocessing, and computational demands"—resemble precisely the kinds of demands that digital humanists are likely to make in the course of trying to do their work.

Treat data curation activities as “publishing"—worthy of new enthusiasm and new resources from libraries—but be wary of framing the endeavor as "data publishing” (an analog to journal and monograph publishing)? What form could this actually take? First, I return to part of the definition of data curation offered above: “curation activities enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time.” Many of the kinds of work that librarians do meet this definition: creating metadata, building catalogs, developing and refining indexes, building, organizing, and maintaining collections. The extension of library, archive, and informationist practice into new forms of work also applies here: aggregating data, cleaning and normalizing data, annotating data with controlled vocabularies and ontologies. Second, and offering a specific example from the digital humanities, data-curation-as-publishing could look something like the Alexandria Archive Institute’s Open Context project. Open Context work on review, documentation, and publication of research data, mostly in the discipline of archaeology. The first heading on “About” page of the project web site speaks of “data sharing as publication” and a flavor of the work the project carries out can be gleaned from the editors' blog. Open Context is hosted and administered by the non-profit Alexandria Archive Institute and thus represents a kind of freestanding example of an organization doing data-curation-as-publishing. This recalls an interesting remark that Choudhury, Furlough, and Ray make in passing. In describing the creation of the Data Conservancy architecture and service at Johns Hopkins, they write: “It is especially important to note the role of a particular individual at AAS who acted as the human “interface” between the various players. This individual could easily be classified as a “data scientist” – an individual with knowledge of a specific domain or discipline yet also a deep knowledge of data management.” They go on to remark that “libraries would be wise to consider developing such expertise and capacity in-house.” I contend that Open Context, and its editors, represent another example of this kind and that libraries should be figuring how to set up and host such activities. At the University of Maryland Libraries, those working on data curation are beginning to make the case to subject selectors (who control collection budgets to support various disciplines) to spend collection funds on curation work for significant data sets. These discussions are still at early stages—there is lots to figure out including what specifically should appear on “the invoices” for such data curation work as selectors are being asked to pay—but libraries who wish to engage seriously with support for data-intensive research (like the digital humanites) will increasingly need to sell and buy such services.

Finally, why argue for framing this work and these transactions around data curation as “publishing” activity? Because I think data curation activities are fully legible as “publishing"—meeting the same ends and goals and potentially contributing to scholarship in the same kinds of ways. Also because "library publishing” is a site of buzz and activity and potential investment. Despite how it might sound this is the opposite of cynical. I would argue that if libraries are going to invest resources in “publishing” then that money should be spent partly on doing data curation work because data-curation-as-publishing offers the most value to both researchers and libraries. (Note: my fellow panelists Matt Gold and Matthew Jockers also offered compelling visions of how to deploy some of the “publishing” resources to support digital humanities.) Data-curation-as-publishing is the right form of publishing for libraries to be in because the work of data curation aligns with libraries' missions and values in ways that other kinds of publishing ventures do not. (There is much about scholarly “publishing” as it exists now that is not about making knowledge public or ensuring quality of that knowledge or disseminating it to those who need and could use it. There is a great deal of “publishing” that is about issues of prestige, labor, and equity of the disciplinary professions. In my opinion, libraries don’t really have a dog in that fight and shouldn’t spend resources trying to fix those problems.) In a recent paper in the library and information science literature on assessing data value, Carole Palmer, Nic Weber, and Melissa Cragin remind us that “the library and information science meta-science perspective articulated by [Marcia] Bates (1999) has always been fundamental to the role of providing broad, useable information collections and services, especially to support interdisciplinary research.” Doing data curation work (like that described above) needs the unique training and skills of librarians and other information professionals and it supports the goals and values of the profession in making information accessible and usable to communities of users who need it. Making data curations fully legible as publishing, and investing in data-curation-as-publishing, can help make problems of data discovery, interoperabilty, and re-use less daunting and show a clear way for the library to be a publisher in ways that research communities like the digital humanities need.

Digital humanities in the library isn't a service

2012-08-19T00:00:00+00:00

Last week, Miriam Posner started an interesting discussion online with a blog post about some of the challenges of doing digital humanities in libraries. Posner identifies challenges rooted in the structure of libraries as organizations, in the organizational position of libraries within universities, and in (academic) library culture.

As a former Mellon Postdoctoral Fellow at Emory University Library, and someone who has done research on digital humanities efforts in libraries, Posner is qualified to speak on the subject but more importantly, her analysis proves itself to be insightful and, I think, helpful to the library community. Several smart comments have been posted on Posner’s original blog post, and two longer responses, by Michael Furlough, a library administrator at Penn State, and by the Library Loon, are also worth reading for their additional observations on challenges to doing digital humanities in libraries.

As a librarian working in a joint position spanning a digital humanities center and a university library, I spend a lot of my time thinking about how to overcome the kinds of challenges Posner, Furlough, and the Loon identify. While I was tempted to jump into the discussion about digital humanities in libraries last week, I didn’t make time to write this post sooner because I was pulling together the first workshop in a year-long program of activities supporting librarians, library staff, and graduate assistants who want to do digital humanities work here at Maryland.

Before I say more about the work we’re doing, I want to pick at two things that have been bothering me in this whole line of discussion. They’re related and both concern the way the discussion of digital humanities in libraries is being framed. A significant portion of the responses seem to assume that when we are talking about “doing digital humanities” in libraries, we are talking about some kind of service libraries might provide. Posner herself does not make this argument explicitly (perhaps reflecting her background as a DH scholar not a librarian or library administrator). Yet, Posner does place faculty members at the center of her picture of DH in libraries (see the tableau in the first paragraph of her post) and so it might be logical to assume that what she means is DH-as-a-service that libraries can provide for their faculty. Framing digital humanities in libraries as a service to be provided and consequently centering the focus of the discussion on faculty members or others outside the library seem likely to stall rather than foster libraries engagement with digital humanities.

Digital humanities in libraries isn’t a service and libraries will be more successful at generating engagement with digital humanities if they focus on helping librarians lead their own DH initiatives and projects. Digital humanities involves research and teaching and building things and participating in communities both online and off. In libraries, digital humanities should involve those same things.

I’m not arguing that digital humanities work can only take the form of one-off projects or ad hoc collaborations. I believe that good digital humanities work is exploratory and innovative, that it usually takes a false start or two on the way to its ultimate ends and that all of these qualities make it hard to conceptualize from scratch as a service. Something that starts out as a DH project might become a service, or it might generate enough mass to look like a collection we in libraries are more familiar with and that we can build other services around. Experience or tools from a DH project might be able to spin off into a new service or improve an existing service. Jenn Riley at the Carolina Digital Library and Archives at UNC, and Jennifer Vinopal and Monica McCormick at NYU are pursuing some interesting experiments with this approach.

Better, I think, for libraries to support space and resources for interesting, possibly risky DH projects and to think of “technology transfer” as the key service to develop. Bethany Nowviskie, from whom much of the smartest current work on libraries and DH comes, has described a “path to production for scholarly R&D” in libraries that is a must-read on this subject. The idea of “technology transfer” in this context comes from Babak Hamidzadeh, the Associate Dean for Information Technology at the University Libraries at Maryland (my boss), and it is important to acknowledge the point that Posner and Furlough make about the need for strong, creative leadership in libraries looking to support DH. Hamidzadeh and Dean of Libraries Pat Steele are crucial to DH in the libraries at Maryland in the same way that Nowviskie, in her role as an administrator, has help shape and protect DH in the libraries at Virginia.

I’m not arguing that digital humanities work can’t be managed in a way that admits strategic direction and efficient marshaling of limited resources but I love Steve Ramsay’s description of what digital humanities in the library can be and I think it points away from services (at least as a starting point):

… Of all scholarly pursuits, Digital Humanities most clearly represents the spirit that animated the ancient foundations at Alexandria, Pergamum, and Memphis, the great monastic libraries of the Middle Ages, and even the first research libraries of the German Enlightenment. It is obsessed with varieties of representation, the organization of knowledge, the technology of communication and dissemination, and the production of useful tools for scholarly inquiry. But DH is also, itself, a scholarly activity – concerned not just with presenting knowledge or helping to locate it, but with creating it.

This is the good stuff. I choose to believe that this is why so many libraries want to “do digital humanities” even if they don’t feel ready to or don’t know how to get started. The challenges that Posner identifies are real but thinking about services (and “faculty”) will not help those of us in libraries who want to do more DH overcome them. Enabling anyone in the library who wants to “do DH” to be involved and to have at least some way for librarians, library staff, and GAs to start pursuing their own DH ideas will be a more productive starting point. As my MITH colleague Matthew Kirschenbaum writes in one of his contributions to Debates in the Digital Humanities, “At a moment when the academy in general and the humanities in particular are the objects of massive and wrenching changes, digital humanities emerges as a rare vector for jujitsu” (415-416).

Digital humanities in libraries isn’t a service, but then, digital humanities outside of libraries isn’t a service either. The field has worked very hard to correct the misconception that digital humanities is a service activity. For libraries to approach digital humanities as a service to be provided runs somewhat counter to the grain of the field at this stage in its development. Having those who work on digital projects claim identities as researchers rather than as some other kind of academic employees who serve faculty research is important for addressing the issues of power balance within the academy. Thinking critically about the organization of academic labor is an important part of digital humanities work as well.

Librarianship is intellectual work. Doing digital humanities in the library should be (re)centered on the research questions and intellectual agendas of librarians. I believe that the good work that comes from this re-centering will attract partners — whether they be faculty, other librarians, students, or the public.

So, while I disagree with some of the ways Posner’s piece frames the discussion of DH in libraries, this post is not intended as a rebuttal—her analysis is correct about many of the challenges within the library environment to getting DH done. I want to write instead about some of the ways we’re trying to overcome these challenges at Maryland.

Part of my role is to support joint initiatives between the University Libraries and the Maryland Institute for Technology in the Humanities (MITH). The University Libraries was one of the founding supporters of MITH in 1999, but over the course of more than a decade, the relationship between the center and the libraries had grown weaker than either side would desire. Dean Steele and Neil Fraistat, the Director of MITH, created a joint position to rejuvenate those ties.

I feel incredibly fortunate to be able to work every day on bridging a DH center and a university library with the help of great colleagues on both sides. Despite this, I was keenly aware that we could not make real change on the force of personalities alone and that one new hire could not solve the problem. We needed to build structures bigger than any one person — to make clear points of attachment where other people in the libraries and MITH who wanted to be involved could begin to participate. Our first attempt at making these structures visible was a new charter between the libraries and MITH, signed this spring, which spells out specific and reciprocal activities.

A major element of that charter is a re-imagining of MITH’s existing faculty fellowship program in a way that specifically serves our library partners. Jennifer Guiliano and I, with input from Travis Brown, Jim Smith, and the rest of our colleagues at MITH, recast a fellowship program for faculty projects into a program of workshops, tutorials, “office hours,” and project consultations intended to help introduce library faculty, staff, and graduate assistants to digital humanities. Participants who attend the workshop program will be guided through the process of developing digital humanities project ideas, finding data, evaluating tools, and crafting a compelling proposal for funding support (internal or external).

The MITH-University Libraries Digital Humanities Incubator is inspired by Nowviskie’s notion of the special place for a “skunkworks” in the library, but also by examples from the tech world that have always interested me (for instance, the startup incubator Ycombinator), and by some parallel experiments in scholarly communication and humanities graduate training. The space of the Incubator is a space that is neither MITH, which must stay lean as a research center, nor the Libraries (which is not say, not a bread-and-butter, keep-the-lights-on service). I hope it will be a safe space that we as leaders in the libraries and MITH have carved out where new people can experiment with digital humanities and imagine new projects for librarians (and perhaps their future collaborators, whether they be faculty members or not). MITH is committing our roster of immensely-capable DH practitioners to this effort. Dean Steele and the rest of the library leadership has committed to supporting librarian participation in the initiative with approved release time and resources. To be completely honest, this program has come about both through strong library and center leadership, and also through wielding the power of “lazy consensus” for good. The desire to plan for digital humanities can itself be an impediment to getting digital humanities done.

The Incubator will run until December when participants will have the opportunity to present ideas they’ve worked on in a pitch round where they can receive feedback from MITH and Libraries staff. The most compelling proposal will receive 9 months of more-focused support from the MITH team (akin to a traditional fellowship). Turnout for the first of the four major workshops was strong (21 attendees, with another 20 or so scheduled to attend the second time slot).

We’re also working to help create more training opportunities in digital humanities — not only for our campus but for the wider community.

The challenges to getting DH done in libraries (and outside them) are real. Setting aside the idea of digital humanities as a service and the concomitant focus on faculty (at least while DH is getting started in a new community) will be important for libraries who want to do DH. As a recipe for digital humanities in libraries, I propose instead: creating a space outside existing commitments, having administrator (and library opinion leader) support for librarians to participate in that space, wielding the power of lazy consensus to just get down to work, and making sure channels are open along paths to production for tech transfer back to existing mission-critical library services an activities.

You can comment on this post on Github

Attending the Knowledge Organization and Data Modeling in the Humanities Workshop at Brown University, March 14-16, 2012

2012-03-12T00:00:00+00:00

cross-posted from the MITH blog

This week I will be one of the participants at a three-day workshop on “Knowledge Organization and Data Modeling in the Humanities” co-sponsored by the Centre for Digital Editions at the University of Würzburg and the Brown University Center for Digital Scholarship, and hosted by Brown. The workshop was organized by Julia Flanders (Brown University) and Fotis Jannidis (University of Würzburg) and is being supported through generous funding from the DFG/NEH Bilateral Digital Humanities Program.

The roster of other speakers at the event is top-notch and all the presentations promise to be engaging. For my part, I was asked to contibute to a more practically-oriented section on “Research Ontologies"—focusing on how data modeling happens in humanities projects—and also, to talk a little about the similarities and differences related to modeling data in a DH context versus a library one.

I’ll be speaking about the process we’re engaged in now here at MITH for developing the data models that underlie our transcriptions of materials for the Shelley-Godwin Archive. Data modeling is an immensely-important but largely under-discussed topic in digital humanities (it’s certainly under-represented in the published literature—though this conference is part of an effort to change that). Data modeling is part of the crucial “DH-specific” intellectual work of translating between the (often implicit) understandings that scholars have of the objects they study and the affordances of a particular digital technology (which might be a relational database or TEI XML or any number of things). For projects that are attempting to conform to standards and best practices, data modeling never begins from a blank slate. I’m particularly interested in the ways—social, conceptual, and technical—that we use to build (and hopefully share) data models for our projects within an ecology of “received ideas” in the form of something like the TEI Guidelines.

The workshop is intended to engage remote participants as well as those of us who will be in Providence this week. I strongly encourage interested members of MITH’s community to follow along on the web and via Twitter.