Collaborative Manuscript Transcription

THATCamp Gratitude

noreply@blogger.com (Ben W. Brumfield) — Mon, 02 Mar 2020 14:32:00 +0000

In 2012, I wrote this email to Dave Lester and Jeremy Boggs:

Yesterday, I was telling someone that I'd had three big breaks to be able to do DHish work as a career. Enumerating them I realized that two of them were or involved THATCamps!
Thanks, guys! I'm following my dreams thanks to y'all.

THATCamp was the first encounter I’d had with academia since graduation (aside from appeals to donate to my alma mater’s annual fund). I was astonished to discover that most of the other campers had (or were working on) PhDs, but their welcome and enthusiasm soon set me at ease. By the end, I’d made some good friends and gotten over most of my technical challenges.

I missed THATCamp in 2009, but Dave and Jeremy were enthusiastic about another THATCamp here in Austin, while Jeanne Kramer-Smythe would be in town, so--with Lisa Grimm and Peter Keane--the first regional THATCamp was born. I left the 2012 THATCamp AHA with an opportunity to quit my industry job and work on DH full time, and--with encouragement from yet more THATCampers--took it. Over the past dozen years, the generosity of the people I’ve met to share their wisdom and perspectives has been a model that my partner Sara and I have tried to emulate.

So thanks again to Jeremy and Dave, to RRCHNM, and to all the colleagues who have been part of the conversation at THATCamp.

Beyond Coocurrence: Network Visualization in the Civil War Governors of Kentucky Digital Documentary Edition

noreply@blogger.com (Ben W. Brumfield) — Tue, 15 Aug 2017 20:57:00 +0000

On August 10, 2017, my partner Sara Carlstead Brumfield and I delivered this presentation at Digital Humanities 2017 in Montreal. The presentation was coauthored by Patrick Lewis, Whitney Smith, Tony Curtis, and Jeff Dycus, our collaborators at Kentucky Historical Society.

This is a transcript of our talk, which has been very lightly edited. See also the Google Slides presentation and m4a and ogg audio files from the talk.

[Ben] We regret that our colleagues at the Kentucky Historical Society are not able to be with us; as a result, this presentation will probably skew towards the technical. Whenever you see an unattributed quotation, that will be by our colleagues at the Kentucky Historical Society.

The Civil War Governors of Kentucky Digital Documentary Edition was conceived to address a problem in the historical record of Civil War-era Kentucky that originates from the conflict between the slave-holding, unionist elite with the federal government. During the course of the war, they had fallen out completely. As a result, at the end of the war the people who wrote the histories of the war—even though they had been Unionists—ended up wishing they had seceded, so they wrote these pro-Confederate histories that biased the historical record. What this means is that the secondary sources are these sort-of Lost Cause narratives that don't reflect the lived experience of the people of Kentucky during the Civil War. So in order to find about that experience, we have to go back to the primary sources.

The project was proposed about seven years ago; editorial work began in 2012 – gathering the documents, imaging them, and transcribing them in TEI-XML. In 2016, the Early Access edition published ten thousand documents on an Omeka site, discovery.civilwargovernors.org. Sara and I became involved around that time for Phase 2.

The goal of Phase 2 was to publish 1500 heavily annotated documents that had already been published on the Omeka site, and to identify people within them.

The corpus follows the official correspondence of the Office of the Governor. As Kentucky was a divided state, there were three Union governors during the Civil War, and there were also two provisional Confederate governors.

Fundamentally, the documentary edition is not about the governors. We want to look at the individual people and their experience of war-time Kentucky through their correspondence with the Office of the Governor. This correspondence includes details of everyday life from raids to property damage, to – all kinds of stuff: when people had problems, they wrote to the governor when they didn't know where else to go.

If we're trying to highlight the people within these documents, how do you do that within a documentary edition? In a traditional digital edition, you use TEI. Each individual entity that's recognized—whether it's a person, place, organization, or geographic feature—will have an entry about it created in the TEI header or some external authority file, and the times that they are mentioned within the text will be marked up.

Now when done well, this approach is unparalleled in its quality. When you get names that have correct references within the ref attribute of their placeName tags --- you really can't beat it. The problem with this approach is that it's very labor-intensive, and because it's done before publication, it adds an extra step before the readership can have access to the documents.

The alternative approach which we seen in the digital humanities is the text mining approach, in which existing documents have Named Entity Recognition and other machine learning algorithms applied to them to attempt to find people who are mentioned within the documents.

Here's an example, looking for places, people, or concepts within a text.

The problem with this approach—while it's not labor-intensive at all—is that it doesn't produce very good quality. The third Union governor of Kentucky, Beriah Magoffin, appears with one hundred and seven variants within the text that have been annotated so far. These can be spelling variants, they can be abbreviations, and the there are all these periphrasitic expressions like “your Excellency”, “Dear Sir”, “your predecessor” (in a letter to his successor). So there is all kind of variation in the way this person appears [in the text].

Furthermore, even when you have consistency in the reference, the referent itself may be different. So, “his wife” appears in these documents as a reference to eight different people. What are you going to do with that? No clustering algorithm is going to figure out that “his wife” is one of these eight people.

Our goal was to try to reproduce the quality of the hand-encoded TEI-XML model in a less labor-intensive way.

[Sara] So how do we do that? We built a system, called Mashbill, for a cadre of eight GRAs, each assigned 150 documents from the corpus, who used a Chrome plug-in called hypothes.is to highlight every entity in the published version of the documents. So the documents are transcribed and published, and the GRAs highlight every instance of an entity.

If we look at the second [highlight] “Geo W. Johnson, Esq”, they highlight it, and then they use Mashbill, where we use the hypothes.is API to pull in all of their verbatim annotations.

Each GRA sees the annotations they have created, and [next to each] is an “identify” button. This pulls the verbatim text into a database search using Postgres's trigram library to look for closest matches within our database of known entities.

“Geo W. Johnson, Esq” has the potential to match a lot of people—mostly based on surname. It looks like it might be the second one, George Johnston (a judge), but probably it's George Washington Johnson—halfway down the page—who was one of these provisional Confederate governors. The GRA would choose that to associate the string with the entity in the database, but if they couldn't find an entity—remember that the goal is to find all the people in the corpus who are not already known to historians—they have the ability to create an entity record.

When you create a new entity or when you're working with an entity, we flesh out a lot of really rich information about that entity within the tool. The GRAs would fill in attributes from their research into a set of approved references for Kentucky in this period, including dates, race, gender, geographic location (latitude/longitutde). We also get short biographies which will be incorporated into the edition, and also a list of documents [mentioning the entity].

Once you have the information, you can do a lot with the entities really quickly. We can do rich entity visualizations: the big dot is people, places, organizations and geographic features; you can look at gender of entities within the corpus; we can look at entities that appear more often than others and who they are. You can do a lot of high-value work with the data.

We can also look at documents and the places that they mention – large dots are places that are mentioned more often in the documents.

[Ben] The last stage of this, is—once the entity research is finished and once the annotations for the document have all been identified—the Mashbill system will produce a TEI-XML file for every entity. It will also update the existing TEI documents that were created during the transcription process with the appropriate persName, placeName, orgName tags with references to [the entity files]. It will also automatically check those files into Github so that the Github browser interfaces will display the differences between [the versions].

So we end up with an output that is equivalent to a hand-coded digital edition which is P5-compliant TEI, but which we hope takes a little bit less labor.

If we're trying to look at relationships between people in this corpus, we need to define those relationships. One traditional method—which we saw earlier in [François Dominic Laramée's presentation, "La Production de l’Espace dans l’Imprimé Français d’Ancien Régime : Le Cas de la Gazette"]—is coocurrence: trying to identify entities that are mentioned within a block of text. Maybe [that block] is a page, maybe it's a paragraph, maybe it's a sentence or a word window.

But coocurrence has a lot of challenges. For example, [pointing] if we look right here at “our Sheriff” (who is identified as Reuben Jones, I think) is mentioned within the same paragraph as these other names. But the reason he's mentioned is – it's just an aside: we sent a letter via our sheriff, now we're going to talk about these county officers. There's no relationship between the sheriff, Reuben Jones, and the officers of the county court. The only relationship that we know if is between Reuben Jones and the letter writer – and that's it. Coocurrence would be completely misleading here.

[Sara] So what do we do instead?

Once you've identified the entities within the text, the next step in the Mashbill pipeline is to define the relationship that you're seeing. Those might be relationships that are attested to by the document itself, or they might be relationships that the GRAs found over the course of their research for the biographies.

Mashbill displays a list of all the entities that appear in a document, and the GRAs choose relationships for those entities based on their research. We have six different types of relationships—social, legal, political, slavery, military—and we also prompt the GRAs, showing them what we already know about the relationships of the [entities mentioned within a document].

So we have richer relationship data than a lot of traditional computational approaches, which means that you can do visualizations which have more data encoded within them, and can be more interesting.

This is Caroline Dennett, who was an enslaved woman who was brought [to Kentucky] as contraband with the Union Army, was “employed” by a family in Louisville, and was accused of poisoning their eighteen month old daughter. There are a lot of documents about here, because there are people writing to the governor about pardoning her, or attesting to her character (or lack of ability to do anything that horrible).

What we did with our network is not just Caroline and all the people and organizations she was related to, but rather we have different types of relationships. We have legal relationships, political relationships; we have social relationships. So a preacher in her town was one of the people who wrote to the governor on her behalf, so we show a social relationship with that person. We have about three different types [of relationships] displayed in different colors on this graph.

What are our results?

As of a week ago, this project had annotated 1228 documents with 15931 annotations. Of those annotations, 14470 have been identified as 8086 particular entities. On our right [pointing], we have the distribution of annotations on documents: some of them, like petitions have as many as 238 names, but our median is around eight entities named per document.

You can find the project at civilwargovernors.org. That's the Early Access version which is just the transcriptions; by October those will be republished with all the biography data and the links between the documents and the entity biographies.

The software is on Github. I'm Sara Brumfield, this is Ben Brumfield; we're with Brumfield Labs. Patrick Lewis is the PI on this project, Whitney Smith, Tony Curtis, and Jeff Dycus are editors and technologists at the Kentucky Historical Society. We also want to thank the graduate research assistants.

[applause]

Many questions were very faint in the audio recording; as a result, the following question texts should be regarded as paraphrase rather than transcripts.

Question: You mentioned the project's goals of trying to get beyond a pro-slavery, pro-Confederate historical record. Do you have an idea of how that's going?

Answer: [Ben] What we find is that the documents skew male; they skew white. So it's not like we can create documents that don't exist. But what we can do now is identify documents and people, so you can say “Show me all the women of color who are mentioned within the documents; I want to read about them.” So at least you can find them.

Question: Despite the workflow and process, it seems like there are still a lot of hours of labor involved in this. Can you give us an idea of the amount of labor involved in this project, outside of building the software?

Answer: [Sara] The budget for the labor was $40,000, which hired eight GRAs for the summer. [Ben] They're not done yet, but we think they will achieve the goal of 15,000 entities. It's hard to tell the difference between this and a TEI tagging project, in part because—in addition to identifying entities—every single entity had to be researched, and a biography had to be written for them if possible. [Sara] That's obviously labor-intensive. From a software perspective, we tried to think really hard about how to make this work go faster. So using hypothes.is for annotation: hypothes.is is really slick, and we also didn't have to build an annotator, so that keeps your costs of software development down. So that went really fast. Trying to match entities to choose; we tried to do a lot of that sort of work to make the GRAs as effective as possible. [Ben] But they still have to do the research; they still have to read the documents.

Question: All of your TEI examples focus on places – were you able to handle other kinds of entities?

Answer: [Ben] We concentrated on people, places, and organizations, but one interesting thing about this approach is that—if you look up here at entities mentioned more than ten times, and I'm sorry there's no label—the largest red blob and the largest blue blob are both Kentucky. One of them is the Government of Kentucky; the other is Kentucky as a place. Again, humans can differentiate that in a way that computers can't. [Sara] We did organizations, people, places, and geographic features .

Question: This is a fantastic resource for not just Kentucky Historical Society, but also in terms of thinking through history in the US. I was wondering what your data plan was, and how available and malleable is the data that you produce.

Answer: [Sara] The data itself is flushed to Github as TEI documents, so every entity will have a document there, as well as every document. The database itself is not published anywhere. [Ben] Our goal with this was that, by we got to the “pencils down” phase of the project, everything was interoperable, in Github, so that people could reconstruct the project from that, and that no information was lost – but that's the extent of it.

Question: A technical question – I missed the part with Github. How does that work?

Answer: [Ben] So the editors were looking for a way of exposing the TEI for reuse by other people. Doing all this work on TEI, then locking it away behind HTML is no fun. That said, they were not that happy with—and they loved the idea of Github as a repository; we had used it before for the Stephen F. Austin papers as a raw publication venue—they were really not comfortable with their graduate research assistants having to figure out how git works, and what do you resolve merge conflicts, and such. As a result, Mashbill—the Ruby on Rails application that we built—every time there's a change to a document or an entity, it does a checkout and merge, finds the TEI, adds the tags—essentially merges all that data in—and then checks that back into Github. As a result [the GRAs] are able to use the Github web interface to see the diffs and publish the data, but they didn't have to actually touch git. [Sara] Right, but the editors might, if they need to.

About us: Brumfield Labs, LLC is a software consultancy specializing in digital editions and adjacent methodologies like crowdsourced transcription, image processing/IIIF, and text mining. If you have a project you'd like to discuss, or just want to pick our brains, we'd love to talk to you. Just send a note to benwbrum@gmail.com or saracarl@gmail.com and we'll chat.

Tools and Techniques for Enhanced Encoding of Account Books from US Plantations (MEDEA2)

noreply@blogger.com (Ben W. Brumfield) — Sat, 10 Dec 2016 14:35:00 +0000

In April of 2016, Anna Agbe-Davies and I attended MEDEA2 (Modeling Semantically Enhanced Digital Editions of Accounts) at Wheaton College in Norton, Massachusetts to meet with other scholars and technologists working on digital editions of financial records. This is the talk we gave, compiled from Anna's prepared remarks and a transcript of my oral presentation.

Introduction

Agbe-Davies: Manuscript accounts from plantations come in a variety of forms: accounts of plantation residents (enslaved and free) as they frequented local stores; records of the daily expenditures and income realized by slave owners as a direct result of their human property; and accounts tracking economic exchanges between plantation owners and the laborers on whom they depended for their livelihood. The data recorded in these sources present an unparalleled opportunity for scholarly analysis of the economic and social structures that characterized the plantation for people throughout its hierarchy.

The properties of these manuscripts are simultaneously the source of their richness and the font of many challenges. The average American--for all we think we know about our recent plantation past--has little idea of the economic underpinnings of that regime and likewise little sense of how individual men and women may have navigated it. The idea that enslaved people engaged in commercial transactions, were consumers, at the same time that they were treated as chattel property, runs counter to our understanding of what slavery meant and how it was experienced.

Therefore, primary documents challenging these deeply-held beliefs are an important resource, not only for researchers, but the general public as well. We have set out to develop a mechanism that delivers these resources to a wider public, enables their participation in transcription, which makes these sources readable by machines and by people not well-versed in 18th- and 19th-century handwriting.

Just to review, for those of you who were not present for our paper in Regensburg. Neither of us is an historian. I am an archaeologist and, as is usual in the US, also an anthropologist. I came to texts such as the “Slave Ledger” discussed throughout this presentation with a straightforward question: what were enslaved people buying in 19th-century North Carolina? In this sense, the store records complement the archaeological record, which is my primary interest. Clearly, however, these texts have additional meanings and potential for addressing much more than material culture and consumption. This is exciting for the anthropologist in me. Ben is editing the account books of Jeremiah White Graves, a ledger and miscellany from a Virginia tobacco plantation. We are collaborating to extend the capabilities of Ben’s online transcription tool FromThePage, to unleash the full analytical possibilities embodied in financial records. This paper follows up on our previous contribution by showing how the new version of FromThePage meets the challenges that we outlined in October.

Sources

Stagville was the founding farm for a vast plantation complex assembled by several generations of Bennehans and Camerons. A local historian, estimates that at their most powerful, the family owned around 900 men, women, and children (Anderson 1985:95). Some of the people at Stagville stayed on after Emancipation, allowing for a fascinating glimpse of the transition from slavery to tenancy and wage labor.

Daybooks and ledgers from plantation stores owned by the Bennehan-Cameron family cover the years 1773 to 1895. Many of the men and women whose purchases are recorded therein were the family’s chattel property and, in later years, their tenants or employees. There are forty-five daybooks and twenty ledgers in the family papers, which are collected in the University of North Carolina’s Southern Historical Collection. Eleven volumes are flagged by the finding aid as including purchases by “slaves” or “farm laborers,” though many volumes have no summary and may contain as-yet unidentified African American consumers.

My plan is to digitize and analyze a selection of these daybooks and ledgers. This project augments the Southern Historical Collection’s effort to make important manuscripts available via the Internet. My project not only increases the number of volumes online and in a format that enables analysis by users with varying levels of expertise, but makes the contents of these documents available as data, not merely images.

Aims and Problems

We wanted the tool to be user-friendly. We didn't want users to have to learn complicated or esoteric conventions for the transcription or marking of the texts. Support for encoding, display, and export of tabular data appearing within documents. This feature allows Markdown encoding of tables appearing within documents, and enhances the semantic division of texts into sections, as Ben will discuss below.

In pilot studies I had done before adopting FromThePage, participants cited the need to have the manuscript being transcribed visible at the same time as the transcription window.

One of the problems with transcription is how to treat variations in terminology and orthography. This conversation was discussed at the last MEDEA meeting as a difference between historical and linguistic content.

An analyst has good reason to want to treat all variants of whiskey as one category. But by the same token, differences in how the word is rendered are important information for other forms of analysis: who wrote the entry; do differences signify levels of literacy or specialized knowledge?

Just like commodities, people’s names come in several forms. Unlike commodities, the same name may refer to more than one individual. When it is possible to merge multiple names into a single individual or distinguish two similar names as two different people, it’s important to be able to record that insight without doing violence to the original structure of the manuscript.

Encoding

Brumfield: We don't usually think about wiki encoding as encoding--we don't usually think about plain-text transcription as encoding--but fundamentally, it is.

The goal of wiki encoding is quick and easy data entry. What that means is that where possible, users are typing in plain text. Now this is a compromise. It is a compromise between presentational mark-up and semantic mark-up. But fundamentally all editions are compromises. [inaudible]

If the user encounters a line break in the original text, they hit carriage return. That encodes a line-break. For a paragraph break, they encode a blank line. So you end up with something very similar to the old-fashioned typographic facsimile in your transcript.

That's not enough, however -- you have to have some explicit mark-up. That's where we've added lightweight encoding. Most wiki systems support that, but the one which we use most prominently are wikilinks that are backed by a relational database system that records all of the encoding.

What's a wiki-link? Here's an example: there are two square braces, and on one side of a pipe sign is the canonical name of the subject [while on the other side you have the verbatim text.] In this account, Anna has encoded a reference to gunflint, and the verbatim text is "flint", and it appears in the context of "at sixpence". That's not that complex, but it does give the opportunity to encode variation in terminology; to resolve all these terms into one canonical subject.

Because all of the tags are saved into the database, you can do some really interesting things, like dynamically generating an index whenever a page is saved. If you go to the entry on "gunflint", you can see all of the entries that mention gunflint. You can link directly to [the pages], and you can see the variations. So on page 3 we see "Gunflints", on page 4 we see "Gunflints", and page 5 we see "flints", which is the entry we just saw.

When a transcriber records something like "flint", they are asked to categorize it and add it to an ontology that's specific to a a project. Anna has developed this ontology of categories to put these subjects in. A subject can belong to more than one category. The subjects that are of particular importance to this project are the persons and their status: You can have a person who shows up as an account holder, you can have a person who shows up as enslaved, you can have a person who shows up as a creditor, and the same person can be all of these different things.

In this case, "gunflint" shows up in the category of "arms".

The other thing that you can do is mine this tagging to do some elementary network analysis by looking at colocation. When a subject like 'gunflint' appears most commonly in the same physical space on the page with another subject, you can tell that it's related to that subject more closely than it is to things that appear farther away. Unsurprisingly, 'gunflint' appears most commonly with 'shot' and with 'powder'. So people are going hunting and buying all their supplies at once. (This is why I'll never get a PhD: discovering that people buy powder and shot at the same time is not groundbreaking research! But it could be a useful tool for someone like Anna.)

Another thing we can do: Even though we're not encoding in TEI--we're not using TEI for data entry--we can generate TEI-XML from the system. These are two sections of an export from the Jeremiah White Graves Diary, in which there's a single line with "rather saturday. Joseph A. Whitehead's little" and there's a line-break. That was encoded by someone typing in "Jos A. Whitehead" in angle brackets, then "little", and hitting carriage return.

The TEI exporter generates that text with a reference string and a line break. It also generates the personography entry for Joseph A Whitehead from the the wiki-link connecting "Jos. A Whitehead" to the canonical name.

What does any of that have to do with accounts? That's really great for prose, but it doesn't help us deal with the kinds of tabular data that shows up in financial records.

Here's an example from the Stagville Account books that Anna has encoded, which shows off the mark-up which we developed for this project. And I have to say that this is "hot code" -- this was developed in February, so we are nowhere near done with it yet.

We needed to come up with some way to encode these tabular records in a semantically meaningful way and to render them usefully. We chose the Markdown sub-flavor of wiki-markup to come up with this format which looks vaguely tabular.

Whenever you encounter a table, you create a section header which identifies the table, then a table heading for all the columns, and then you type in what you see as the verbatim text, separated by pipes. You can add lots of whitespace if you want all of your columns to line up in your transcription and make your transcript look really pretty, or not -- it doesn't really matter.

One of the hiccups we ran into pretty quickly was that not all of the tables in our documents were nicely formatted with headers. What we needed to do was encode the status of these columns. All these column headers are important -- they're applied to the role of the data cells below them. So we came up with this idea of "bang notation", which essentially strips this for display, but leaves it for semantic encoding.

That serves our purposes for display and for the textual side of the edition. What about the analytical side? Once all this is encoded, we're able to export--not just in TEI-XML--but we're able to export all of the tabular data that appears in one account book and create a single spreadsheet from it. Because the tables that appear in an account book can be really heterogeneous--some of us are dealing with sources in which you have a list of bacon shipments, and then you have an actual account to a creditor, and then you have an IOU--but you want to be able to track the same amounts from one table to another, when those values are actually relevant.

We do something really pretty simple: we look at the heading, and if something appears under the same heading in more than one table, we'll put it in the same column in the spreadsheet we generate. That means that some rows are going to have blank cells because they've had different headers. Filtering should allow you to put that together. Here you see Mrs. Henry's Abram's account, you have Frederick's account, so you can filter those and say you just want to study Frederick's account. You just want to study those two accounts together.

Furthermore, we also have the ability to link back to the original page, so that you can get back from the spreadsheet to the textual edition.

We've got a lot more to do. We are probably less than fifty percent done with the software development side of this, and way less than that for the whole project.

We need to work on encoding dates that are useful for analysis.
We need to figure out hot to integrate subjects and tables in ways that can be used analytically.
We need to add this table support to our TEI exports.

And then we need to do a lot more testing. We're still working on this.

With that, I turn it back to Anna.

Conclusion

Agbe-Davies: We have conceptualized usability in terms of both process and product. Because FromThePage is designed to facilitate crowdsourced transcription of manuscript accounts the functionality of the input process is as important as the form the output will take. The resulting transcription will be exponentially more readable for nonspecialist users, while at the same time allowing researchers to perform quantitative analyses on data that would otherwise be inaccessible.

Each of these audiences can contribute to the development of these datasets and use them in creative ways. In my association with Stagville State Historic Site, I have the opportunity to share research findings with the general public and they are eager to explore these data for themselves and turn them to their own purposes. Teachers can use this material in their classes. Site interpreters and curators can enrich their museums’ content with it. History enthusiasts can get a sense of the primary data that underlies historical scholarship. Researchers can manipulate and examine transcriptions in ways that are both quantitative and qualitative. In a recent paper in crowdsourcing as it applies to archaeological research, a colleague and I wrote, “[there is an] urgent need for access to comparative data and information technology infrastructure so that we may produce increasingly synthetic research. We encourage increased attention not only the ways that technology permits new kinds of analyses, but also to the ways that we can use it to improve access to scattered datasets and bring more hands to take up these challenges.” A similar argument can be made for the modeling of historic accounts.

Accidental Editors and the Crowd at DiXiT 2

noreply@blogger.com (Ben W. Brumfield) — Thu, 09 Jun 2016 14:06:00 +0000

In March, I gave a "Club Talk" at Dixit 2: Academia, Commons, Society, a conference of digital scholarly editors hosted by the Cologne Center for e-Humanities and the Institut für Dokumentologie und Editorik. The club talk was a real departure for me -- for the first time in my career, my name was on a concert poster!

Club lecture, sword fight & live music @StereoKoeln w/ @benwbrum @colognecommons Register @ https://t.co/Vn942KPONR pic.twitter.com/DTsNEyXDaI
— DiXiT (@DiXiT_eu) February 26, 2016

I tried to keep the material light, and was helped by some serious swordplay by Langes Schwert Köln, the local HEMA group. Langes Schwert had spent weeks preparing a their demonstration, highlighting each phrase in a parallel text edition of a fighting manual as they worked through the moves it described. I really can't thank them enough.

Here is the video, slides, and transcript of my talk, followed by the video and transcript of the swordfighting demonstration.

Thanks to DiXiT for bringing me here and thank you all for coming. All right, my talk tonight is about accidental editors and the crowd. What is an accidental editor? Most of you people in this room are here because you're editors and you work with editions. So I ask you, look back, think back to when you decided to become an editor. Maybe you were a small child and you told your mother, “When I grow up I want to be an editor.” Or maybe it was just when you applied for a fellowship at DiXiT because it sounded like a good deal.

The fact of the matter is there are many editions that are happening by people who never decided to become an editor. They never made any intentional decision to do this and I'd like to talk about this tonight.

So all this week we've been talking digital scholarly editions, tonight, however, I'd like to take you on a tour of digital editions that have no connection whatsoever to the scholarly community in this room.

Torsten Schaßan yesterday defined digital editions saying that, “A digital edition is anything that calls itself a digital edition.” None of the projects that I'm going to talk about tonight call themselves digital editions. Many of them have never heard of digital editions.

So, we're going to need another definition. We're going to need a functional definition along the lines of Patrick Sahle's, and this is the definition I'd like to use tonight. So these are “Encoded representations of primary sources that are designed to serve a digital research need.”

All right, so the need is important. The need gives birth to these digital editions. So what is a need in the world of people who are doing editing without knowing they're doing editing?

Well, I'll start with OhNoRobot. Everyone is familiar with the digital editing platform OhNoRobot, right? Right?

All right, so let's say that you read a web comic. Here's my favorite web comic, Achewood, and it has some lovely dark humor about books being "huge money-losers" and everyone "gets burned on those deals". And now you have a problem which is that two years later a friend of yours says, “I'm going to write a book and it's going to be great.” And you'd say, “Oh, I remember this great comic I read about that. How am I going to find that, though?”

Well, fortunately you can go to the Achewood Search Service and you type in “huge money-loser” and you see a bit of transcript and you click on it...

And you have the comic again. You've suddenly found the comic strip from 2002 that referred to books as huge money losers. Now, how is that possibly? See this button down here? This button here that says “Improve Transcription.” If you click on that button...

You'll get at a place to edit the text and you'll get a set of instructions. And you'll get a format, a very specific format and encoding style for editing this web comic. All right? Where did that format—where did that encoding come from? Well, it came from the world of stage, the world of screenplays. So this reads like a script. And the thing is, it actually does work. It works pretty well. So that community has developed this encoding standard to solve this problem.

Let's say that you're a genealogist and you want to track records of burials from 1684 that are written in horrible old secretary hand and you want to share them with people.

No one is going to sit down and read that. They're going to interact with us through something like FreeReg. This is a search engine that I developed for Free UK Genealogy which is an Open Data genealogy non-profit in the U.K. And this is how they're going to interact with this data. But how's it actually encoded? How are these volunteers entering what is now, I'm pleased to say, 38 million records?

Well, they have rules. They have very strict rules. They have rules that are so strict that they are written in bold. “You transcribe what you read errors and all!”

And if you need help here is a very simple set of encoding standards that are derived from regular expressions from the world of computer programming. All right? This is a very effective thing to do.

One thing I'd like to point out is that in the current database records encoded using this encoding style are never actually returned. This is [encoded] because volunteers demand the ability to represent what they see and encoding that's sufficient to do that even if the results might even be lost, in the hope that some day in the future they will be able to retrieve them.

Okay. So far I've been talking mainly about amateur editions. I'd like to talk about another set of accidental editors which are people in the natural sciences. For years and years naturalists have studied collections and they've studied specimens in museums and they've gotten very, very good at digitizing things like...

This is a "wet collection". It's a spider in a jar and it's a picture I took at the Peabody Museum.

In case you've ever wondered whether provenance can be a matter of horror [laughter] I will tell you that the note on this says, “Found on bananas from Ecuador.” Be careful opening your bananas from Ecuador!

Thanks to climate change and thanks to habitat loss these scientists are returning to these original field books to try to find out about the locations that these were collected from to find out what the habitats looked like 100 years ago or more. And for that these records need to be transcribed.

So here is the Smithsonian Institute Transcription Center. This is going to [familiar] to a lot of people in the room. The encoding is something really interesting because we have this set of square notes: vertical notation in left margin, vertical in red, slash left margin, vertical in red all around "Meeker". The interesting thing about this encoding is that this was not developed by the Smithsonian. Where did they get this encoding from?

They got this encoding from a blog post by one of their volunteers. This is a blog post by Siobhan Leachman who spends a lot of time volunteering and transcribing for the Smithsonian. And because of her encounter with the text she was forced to develop a set of transcription encoding standards and to tell all of her friends about it, to try to proselytize, to convert all of the other volunteers to use these conventions.

And the conventions are pretty complete: They talk about circled text, they talk about superscript text, they talk about geographical names. I'm fairly convinced--and having met Siobhan I believe she can do it--that given another couple of years she will have reinvented the TEI. [laughter].

So you may ask me, “Why are squished into the back of a room?” To make room for the swords. And we haven't talked about swords yet.

So I'd like to talk about people doing what's called Historical European Martial Arts. This is sword fighting. It's HEMA for short. So you have a group of people doing martial arts in the athletic tradition as well as in the tradition of re-enactors who are trying to recreate the martial arts techniques of the past.

So there are HEMA chapters all over. This is a map of central Texas showing the groups near me within about 100 kilometers and as you can see many clubs specialize in different traditions. There are two clubs near me that specialize in German long sword. There's one club that specializes in the Italian traditions and there are—there's at least one club I know of that specializes in a certain set of weapons from all over Europe.

So how do they use them? Right? How do they actually recreate the sword fighting techniques? They use the texts in training. And this is a scene from an excellent documentary called “Back to the Source,” which I think is very telling, talking about how they actually interact with these. So here we have somebody explaining a technique, explaining how to pull a sword from someone's hand...

And now they're demonstrating it.

So where do they get these sources from? For a long time they worked with 19th century print editions. For a long time people, including the group in this room, worked with photocopies or PDFs on forms. Really all of this stuff was very sort of separated and disparate until about five years ago.

So five or six years ago Michael Chidester who was a HEMA practitioner who was bedridden due to a leg injury had a lot of time on the computer to modify Wikisource, which is the best media wiki platform for creating digital editions, to create a site called Wiktenauer.

What can you find on Wiktenauer? Okay, here's a very simple example of a fighting manual. We've got the image on one side. We've got a facsimile with illustrations. We have a transcription, we have a translation in the middle. This is the most basic. This is something that people can very easily print out, use in the field in their training.

Still it's a parallel-text edition. If you click through any of those you get to the editing interface which has a direct connection between the facsimile and the transcript. And the transcript is done using pretty traditional MediaWiki mark up.

Okay. Now, and I apologize to the people in the back of the room because this is a complex document. We get into more complex texts. So this is a text by someone named Ringeck and here we have four variants of the same text because they're producing a variorum edition. In addition to producing the variants...

They have a nice introduction explaining the history of Ringeck himself and contextualizing the text.

What's more, they traced the text itself and they do stemmatology to explain how these texts developed.

And in fact even come up with these these nice stemmata graphs.

So how are they used? So, people study the text, they encounter a new text and then they practice. As my friends last night explained to me, the practice informs their reading of the text. They are informed deeply by die Körperlichkeit -- the actual physicality of trying out moves.

The reason that they're doing this is because they're trying to get back to the original text and the original text is not what was written down by a scribe the first time. The original text, this Urtext, is what was actually practiced 700 years ago and taught in schools. Much like Claire Clivaz mentioned talking about Clement of Alexandria: You have this living tradition, parts of it are written down, those parts are elaborated by members of that living tradition and now they're reconstructed.

What if your interpretation is wrong? Well, one way they find out is by fighting each other. You go to a tournament. You try out your interpretation of those moves. Someone else tries out their interpretation of those moves. If one of you would end up dead that person's interpretation is wrong. [laughter] (People think that the stakes of scholarly editing are high.)

What are the challenges to projects like Wiktenauer? So one of the projects—when I interviewed Michael Chidester he explained that they particularly, editors in the U.S., actually do struggle and they would love to have help from members of the scholarly community dealing with paleography, dealing with linguistic issues, and some of these fundamental issues.

One of the other big challenges that I found is--by contrast with some of the other projects we talked about--in many cases the text on Wiktenauer are of highly varied quality. They try to adjust for this by giving each text a grade, but if an individual is going to contribute a text and they're the only one willing to do it, you sort of have to take what they get. My theory for why Wiktenauer transcripts may be of different quality from those that you see on the Smithsonian or that genealogists produce is that for those people the transcription--the act of working with the text--is an end in itself whereas for the HEMA community the texts are a way to get to the fun part, to get it to the fighting.

And now--speaking of "the fun part"--it's time for our demonstration.

It gives me great pleasure to welcome Langes Schwert Cologne, with
Junior Instructor, Georg Schlager
Senior Instructor, Richard Strey
Head Instructor, Michael Rieck

START AUDIO: [0:02:00]

RICHARD: Okay, two things first I will be presenting this in English even though the text is in German, but you won't really have to read the text to understand what's going on. Also, we will have to adjust for a couple of things. A sword fight usually starts at long range unless someone is jumping out from behind a bush or something. So we'll have to adjust for that. In reality moves would be quite large. All right. So he could actually go to the toilet and then kill me with just one step. So we will be symbolizing the onset of the fight by him doing a little step instead us both taking several steps. All right.

So basically, this is what it's all about. These techniques also work in castles and small hallways. [laughter] All right. Now, again we are Langes Schwert we have been doing the historical German martial arts since 1992. We train here in Cologne and today we would like to show you how we get from the written source to an actual working fighting. You can all calm down now from now on it's going to be a slow match. So, in case you didn't get what happened the whole thing in slow.

Okay, so how do we know what to do? We have books that tell us. For this presentation we will be using four primary sources: fencing books from around 1450 to 1490 all dealing with the same source material. On the right hand side you can see our second source the text you see there will be exactly what we are doing now. Also, we use a transcription by Didien de Conier from France. He did that in 2003, but since the words don't change we can still use it. All right, so how do we know what to do? I can talk in this direction.

He's the good guy, I'm the bad guy.

GEORGE: We can see that. [laughter]

RICHARD: So how does he know what to do? I'll be basically reading this to you in English, we've translated it. In our group we have several historians. Several other members as well can actually read the original sources, but still in training we go the easy and do the transcription. But still usually we have the originals with us in PDF format so in case we figure, “Well, maybe something's wrong there,” we can still look at it. For the most part that doesn't really matter, but the difference between seine rechte seite and deine rechte seite -- "his right side" and "your right side" can make a difference. [laughter]

Okay, so what we're dealing with today is the easiest cut, the wrath cut. The sources tell us that the wrath cut breaks all attacks that from above with the point of the sword and yet it's nothing but the poor peasant's blow. Essentially what you would do with a baseball bat, all right? But very refined. [laughter] So, usually his plan would be to come here and kill me. Sadly, I was better than him. I have the initiative so he has to react. and it says, do it like this, if you come to him in the onset, symbolized, and he strikes at you from his right side with a long edge diagonally at your head like this...then also strike at him from your right side diagonally from above onto his sword without any deflection or defense.

So that's his plan. It would be a very short fight if I didn't notice that. [laughter] So, thirdly it says if he weak—oh no, if he is soft in the bind which implies that I survive the first part which looks like this then let your point shoot in long towards his face or chest and stab at him. See we like each other so he doesn't actually stab me. [laughter]

Okay, it says this is how you set your point on him. Okay, next part, if he becomes aware of your thrust and become hard in the bind and pushes your sword aside with strength...then you should grip your sword up along his blade back down the other side again along the blade and hit him in the head. [laughter] All right, this is called taking it all from above. Okay. This is the end of our source right here. Now we have left out lots of things. There are a lot of things that are not said in the text. For example it says if he's weak in the bind, or soft in the bind, actually I'm not, I'm neutral and then I become soft. How do I know this? It doesn't say, so right here. Well maybe I could just try it being soft. It would basically look like this.

It doesn't really work. [laughter] Now, if being soft doesn't really work maybe being hard does. So I'll try that next. It doesn't really work either. Okay. So this is an example from fencing, from actually doing it you know that in the bind you have to be neutral. If you decide too early he can react to it and you can change your mind too fast.

All right, now what we read here is just one possible outcome of a fight. A fight is always a decision tree. Whenever something happens you can decide to do this or do the other thing and if you fight at the master level which is this you lots and lots and lots of actions that happen and the opponent notices it, reacts accordingly and now if you were to carry out your plan you would die. So you have to abort your plan and do something else instead. So now we'll show you what our actual plans were and what happened or didn't happen. So, I was the lucky guy who got to go first. I have the initiative. So my plan A is always this.

And then I'll go have a beer. [laughter] And the talk is over, but he's the good guy so he notices what happened and his plan is this. He hits my sword and my head at the same time. So if I just did what they do in the movies notice that he's going bang, he is not dead, I'll do something else, no it doesn't work. I'm already finished. So I have to notice this while I'm still in the air. Abort the attack, right? I was going to go far like this, now I'm not going to do that. I'm going to shorten my attack and my stab. And from here I'm going to keep going. He is strong in the bind so I will work around his sword and hit him in the face. How to do this is described I think two pages later or three.

All right. Now, remember I was supposed to be weak or soft in the bind. My plan was to kill him, it didn't work. I had to abort it. When I hit his blade he was strong, I go around it which makes me soft which is what is says there. Okay, sadly he doesn't really give me the time to do my attack and instead, he keeps going. I was going to hit him in the face, but since I was soft he took the middle, hits me in the head. All thrusts are depending on the range. In here we don't have much range so we'll always be hitting. If we're farther apart it will be a thrust. Okay, but I notice that, so I'm not afraid for my head I'm afraid for the people. [laughter]

MAN #1: Divine intervention!

RICHARD: So I notice him hitting me in the head so I'll just take the middle and hit him in the head. It usually works. Okay. Now, obviously, the smart move for him is to just take his sword away, let me drop into the hole because I was pushing in that direction anyway and then he'll keep me from getting back inside by just going down this way. And that, basically, is how the good guy wins. Except, of course, there is a page over there where it tells me how to win. And it says, well if he tries to go up and down you just go in. See, you have to stand here.

So, basically there is never any foolproof way to win. It's always a case of initiative and feeling the right thing. Actually, there is one foolproof way, but we're not going to tell you. [laughter] This concludes our small demonstration. [applause] I'm not finished yet. So, what we did was about 60 to 70% speed. We couldn't go full speed here because of the beam. Also it was just a fraction of the possible power we could do it at. We counted yesterday, we had been for the practice session what you just saw are nine different actions that are taken within about one-and-a-half seconds and that's not studied or choreography. In each instance you feel what is happening. You feel soft, hard, left, right, whatever, it's not magic, everyone does it. Well, all martial arts do it once they get to a certain level. So we would like to thank Didier de Conier who, unknowingly, provided us with the transcriptions. We would like to thank Wiktenauer even though we don't need it that often because we can actually read this stuff. It's a great resource for everyone else and as always we have that. And oh yeah, we train at the Uni Mensa every Sunday at 2. So whoever wants to drop by and join is invited to do so. [applause]

AUDIO END: [0:20:51]

Encoding Account Books Relating to Slavery in the U.S. South at MEDEA Regensburg

noreply@blogger.com (Ben W. Brumfield) — Sun, 25 Oct 2015 10:05:00 +0000

On October 22-24 of 2015, I was fortunate to attend the NEH/DFG-sponsored MEDEA workshop in Regensburg, Germany. The workshop gathered together American and European scholars, editors, and technicians working with digital editions of financial records, and often-overlooked type of textual source. I presented along with Anna Agbe-Davies, a faculty member at the University of North Carolina-Chapel Hill, with whom I am collaborating to extend FromThePage to support tabular data within texts. You can read background on the project at our abstract at the MEDEA website.

This document is a composite of the prepared text delivered by Anna Agbe-Davies and a transcript of the ex tempore talk by Ben Brumfield. Each section will be preceded by the name of the speaker in boldface, with editorial interventions in [brackets].

Agbe-Davies:

Neither of us is an historian. I am an archaeologist and, as is usual in the US, also an anthropologist. I came to texts such as the “Slave Ledger” discussed below with a straightforward question: what were enslaved people buying in 19^th-century North Carolina? In this sense, the store records complement the archaeological record, which is my primary interest. Clearly, however, these texts have additional meanings and potential for addressing much more than material culture and consumption. This is exciting for the anthropologist in me. I have experience with the methods of historical analysis, but the technological advances of the last few years mean that I have much to learn about the best techniques for harnessing the potential of such documents.

Ben and I are collaborating to extend the capabilities of his online transcription tool FromThePage, to unleash the full analytical possibilities embodied in such texts, including the archive he will now describe.

Brumfield:
I'd like to introduce the papers of Jeremiah White Graves. These are three volumes that were bound posthumously from approximately thirty notebooks with roughly 1600 pages worth of diaries, formal accounts, and informal accounts that are held at the Alderman Library at the University of Virginia and which may be accessed online at http://tinyurl.com/JWGravesPapers in facsimile edition.

Jeremiah White Graves moved from Louisa County, Virginia to Pittsylvania County, Virginia when he was fifteen years old. In 1823, at the age of 22, using the skills that he learned as a store clerk, he began keeping accounts on his own. These accounts cover his activities trading with his neighbors, but primarily [cover his activities as] a plantation owner. He acquired the plantation of Aspen Grove, as well as inheriting other plantations. Aspen Grove is 120 kilometers north of Stagville, which is the plantation that Anna will discuss, and--like Stagville Plantation--it primarily produced tobacco crops for cash through the work of enslaved laborers.

Some of his accounts are formal. These may look very familiar to many of you. These are how he started his accounts in 1822, but he soon found that a formal accounting system did not serve his needs very well.

He started keeping informal accounts to track other activities, such as (in this case) visits by a doctor to treat members of his household, both slave and free. These informal accounts also covers shipments of logs, corn or cotton to mills. They cover days his children attended school They also cover articles of clothing his children took with them to boarding schools.

One of the most interesting things about these accounts is the light they shed on the relationship between Graves and his enslaved laborers, and the relationships among them and the rest of the community. One of the challenges of the accounts is that they have a very complex topology. Because the accounts are informal, accounts will be written in separate, unrelated [inaudible].

In this case, we have a two-entry account between Graves and "my Henry"--who is one of his primary enslaved laborers--who he loans money to. Henry then pays him back. So we have two entries in this account.

This account is stuck between shipments of cotton and logs to mills in a previous year, sticks of tobacco [stripped], a later account of tobacco cut in fields, and then a much earlier account of tobacco [stripped] in prize barns.

You see a similar challenge over here [points to second page], where--over the intriguing entries on meat sent to laborers at Aspen Grove Plantation from a different plantation--you find this fascinating account with entries between "my Frederic" (another one of Graves's laborers) and Graves. One of the fascinating things about this account is that Frederic dies, and--in one of the only instances in which Graves records women in his informal accounts--Graves settles his account with Malissa, Frederick's enslaved widow.

Another challenge of the accounts is that they have a complex order. Graves began his notebooks with diary entries from front to back. He would write his accounts from back to front. Then when they met in the middle, he would start a new book, [though] sometimes returning to the older books.

As you see here, we have a four-year-long account that starts on the second-to-last page of the book, continues on page 18, then on page 17, and finally finishes up on page 5 of the volume.

While these accounts are complex, they are not unique, so I will hand this over to Anna.

Agbe-Davies:
Stagville was the founding farm for a vast plantation complex assembled by several generations of Bennehans and Camerons¹. A local historian estimates that at their most powerful, the family owned around 900 men, women, and children (Anderson 1985:95). Some of the people at Stagville stayed on after Emancipation, allowing for a fascinating glimpse of the transition from slavery to tenancy and wage labor.

1. The Bennehan/Cameron holdings included nearly 20,000 acres in Durham, Wake, and Granville Counties in 1890 (McDuffie 1890). Anderson estimated a peak 30,000 acres along the Flat, Eno, and Neuse Rivers, not to mention thousands more in western NC, plantations in Alabama and Mississippi, as well as residences in the county seat and the state capital.

Daybooks and ledgers from plantation stores owned by the Bennehan-Cameron family cover the years 1773 to 1895. Many of the men and women whose purchases are recorded therein were the family’s chattel property and, in later years, their tenants or employees. There are forty-five daybooks and twenty ledgers in the family papers, which are collected in the University of North Carolina’s Southern Historical Collection². Eleven volumes are flagged by the finding aid as including purchases by “slaves” or “farm laborers,” though many volumes have no summary and may contain as-yet unidentified African American consumers.

2. In addition to the daybooks and ledgers, there are also cash books, books of ready money sales, and personal/household account books, numbering 142 “financial volumes.” http://www2.lib.unc.edu/mss/inv/c/Cameron_Family.html#

My aim is to digitize and analyze a selection of these daybooks and ledgers. This project augments the Southern Historical Collection’s effort to make important manuscripts available via the Internet. My project not only increases the number of volumes online and in a format that enables analysis by users with varying levels of expertise, but makes the contents of these documents available as data, not merely images.

One of the questions guiding my research is this: What did it mean to shop in a store, if you yourself can be bought and sold? I am interested in both the financial and social aspects of accounting in the plantation context. Daybooks, and ledgers offer an important compliment to the archaeological record at Historic Stagville, in Durham, North Carolina.

[omitted from the oral presentation:] Archaeologists can speculate about, but seldom demonstrate, the paths by which goods reached the quarter. Artifacts may reflect the actions of the owner who issued clothing or tools and passed along hand-me-downs. Conversely, finds may speak to the agency of the owned, as when they hunted or grew food for their own consumption or purchased items of personal adornment with cash earned on the side. However, neither interpretation is evident in the artifacts themselves. Archaeologists need additional sources of information because these distinctions have implications for how we view material aspects of the relationship between owner and owned—how power was wielded, how demands were negotiated. The daybooks and ledgers are one way in which to capture how African American consumers at Stagville—pre-Emancipation and during the years of Jim Crow—fashioned lives with the things that they bought.

Brumfield: What we plan to do is to use the open-source digital edition tool FromThePage--which I run, though I welcome contributions from anyone else--to digitize these documents -- to transcribe them.

FromThePage already handles transcription and presentation online. The core functionality of FromThePage is the wiki-link. FromThePage handles mark-up using a wiki syntax that is backed by a relation database that suggests mark-up. So if a user sees the phrase "Renan" and they transcribe it, this then is expanded to the canonical name "Renan, Virginia".

This this is used for presentation: Users who see Renan can see the explanation. If they explore the subject, they can see an automatically-generated index.

What we plan to do--now we're moving to the draft design--is to add new wiki mark-up to handle sections that will define different blocks within the text. To continue this, to use MarkDown wiki mark-up to describe tables. This addresses data entry. (We're not big fans of hand-coded XML as a user interface; hand-coded wiki? We'll see how that works.)

But what's important and relevant here is that this [mark-up] is interpreted by the software and then displayed -- in HTML we have a display as simple HTML tables. For TEI, we'll expand to TEI tables with the wiki-links expanded using A tags for HTML or references strings to elements within the TEI header.

We have further ideas for exports -- I'm very interested to see other presentations for ideas for those.

However, to serve Anna's analytical needs, we need to export these tables in CSV format. So what we have designed is the ability to export all records from the collection in a single spreadsheet. The spreadsheet will be sparse, so that entries from different tables that contained the same column header when they were encoded will appear in the same column on the spreadsheet. If one table contains an extra column that other tables did not, that will appear in the final spreadsheet, but tables that did not contain that column will [have blank cells] in the spreadsheet. We also plan to expand the data columns to handle the wiki text, so that both canonical subjects and verbatim text will be included.

Agbe-Davies: I have transcribed one document called the “Slave Ledger,” but have found the result to be inadequate for the analyses I would like to perform. The combination of qualitative and quantitative research goals means that neither transcription, nor a spreadsheet can handle the range of analyses necessary.

The many goods listed in the document (spelled variously) need to be categorized in several ways. Sometimes they are purchases, other times, sources of credit. I would like to be able to find both instances of “shoes” but also other instances of “footwear” and “clothing” and “goods made by other members of the plantation community” Not to mention being able to, in various circumstances either merge or separate “shoes” from “repair of shoes.”

Another form of analysis enabled by tags is pulling out purchases by a single canonical individual, even when different names are used. Using my transcription of the Slave Ledger, I still had to pick out individuals for this chart by hand because no text search would pull out all and only references to Frank Kinnon, when there are multiple “Frank”s and his second name appears with several different spellings and grammatical constructions³.

As this slide also shows, the ability to pull together records by categories—with those categories being multiscalar—is important for the quantitative analyses that I perform. In order to examine both trends and change over time, I will be performing analyses within, across, and among manuscripts. Thus, these tags should live somewhere outside any single document.

I will be examining how people spent precious cash or credit to determine whether gaps were left by the provisioning system during slavery times. If the Benehans’ and Camerons’ human property regularly purchased basic staples it would offer an interesting contrast to the paternalistic, “enlightened” slaveowner of their own imaginations (Anderson 1985:96). In addition, I want to know whether people on the Bennehan-Cameron farms were making similar purchases to folks elsewhere in the plantation South (Heath 2004; Martin 1993). Also what (dis)continuities exist between the pre- and post-Emancipation eras, as households assumed greater responsibility for their own sustenance?

3. For example, Frank Kinnon, Kennon Frank, and Frank Kennon, not to be confused with Old Frank/Old Frank Eno.

Because I am not an expert on account books, I don’t know how unusual this is, but I am finding in the Stagville accounts, many instances of debtors trading credits among themselves, using them as a kind of currency unconnected to store purchases, also, instances of someone buying an item for another debtor, and even instances of cooperative purchase or credits. Again, these don’t fit neatly into a standardized recording structure, hence the need for something that is more flexible than a database or spreadsheet, but which nevertheless retains some of the qualities of those kinds of documents. I am as interested in Solomon’s relations with Britain, Mark, Sam, and Ben, as I am in his relationship to R. Bennehan & Son.

At the moment, I have to choose between capturing the qualities of this text as a physical document, or capturing the information that the text contains. It is doubtless significant that Ned’s and Miller George’s entries are off-set here. I don’t want to lose this information in an effort to fit these transactions into a one-size-fits-all structure, such as a database. Likewise, some accounts (like Davy’s, here) are reconciled frequently, others run for long periods of time without a full accounting of what is owed or credited. It will be important to be able to record interim calculations as well as individual debits and credits.

Once digitized, the resulting product will allow users easily to identify seasonal patterns in purchasing, follow individual shoppers, or discover the popularity of store-bought clothing over time, for example. Such resources can reach audiences with different levels of expertise or interest and provide them with rich, attractive materials for their own use, or let them explore the end result as a virtual museum to complement the physical museum experience. Users could easily search on characteristics of the transactions, such as individual account holder, item, or date, to independently answer their own questions about plantation life and modern consumerism. This exploration may even take place on-site. Historic Stagville has had great success with their genealogical database and the staff and board are eager to work together to develop more resources to share with their visitors and other stakeholders, such as the Stagville Descendants Council, an African American heritage group.

My aim is to open transcription up to include friends of, and visitors to, Stagville State Historic Site. My time in the museum world largely predates the blossoming of the digital humanities, but I do know how compelling interactive experiences can be, and that audiences understand and appreciate knowledge so much more when they have a hand in its creation (Smith 2014).

There is no conclusion. This project is an ongoing effort and we feel fortunate to engage with a community of like-minded researchers before we finalize the protocols for transcription and before Ben does additional programming for FromthePage. We have come to this meeting to learn from the successes, mistakes, and experience of others and look forward to many fruitful exchanges with you all.

WORKS CITED

Anderson, Jean Bradley

1985 Piedmont Plantation: the Bennehan-Cameron family and lands in North Carolina. Durham, North Carolina: Historic Preservation Society of Durham.

Heath, Barbara J.

2004 Engendering Choice: Slavery and Consumerism in Central Virginia. In Engendering African American Archaeology: A Southern Perspective. J.E. Galle and A.L. Young, eds. Pp. 19-38. Knoxville: The University of Tennessee Press.

Martin, Ann Smart

1993 Buying into the world of goods: Eighteenth-century consumerism and the retail trade from London to the Virginia frontier Ph.D. dissertation, History, The College of William and Mary.

McDuffie, D. G.

1890 Map of Honorable Paul C. Cameron's Land on Flat, Eno, and Neuse Rivers in Durham, Wake, and Granville Counties, March 1890. http://dc.lib.unc.edu/cdm/singleitem/collection/00133/id/12258: Manuscript map in the Southern Historical Collection, University of North Carolina at Chapel Hill.

Smith, Monica L.

2014 Citizen Science in Archaeology. American Antiquity 79(4):749-762.

Day of DH 2015

noreply@blogger.com (Ben W. Brumfield) — Tue, 19 May 2015 13:19:00 +0000

For the fourth year, I'm participating in the Day of DH.

You can follow my day at the Day of DH blog.

Best Practices at Engaging the Public at CCLA

noreply@blogger.com (Ben W. Brumfield) — Fri, 08 May 2015 17:05:00 +0000

This is the text of my talk at the best practices panel at the Crowd Consortium for Libraries and Archives meeting Engaging the Public on May 8, 2015.

One caveat: most of my background is in crowdsourced manuscript transcription, though with the development of FromThePage 2 I've become involved in the related fields of collaborative document translation and crowd-sourced OCR correction. I hope that this is useful to non-textual projects as well.

The best practice I'd like to talk about is returning the product of crowd-sourcing to the volunteers that produced it.

What do I mean by product?

I'm not talking about what project managers consider the final product, whether that be item-level finding aids or peer-reviewed papers in the scholarly press. I'm talking about the raw product – the actual work that comes out of a volunteer's direct effort, or the efforts of their fellow volunteers – the transcript of a letter, the corrected text of a newspaper article, the translated photo captions, the carefully researched footnotes and often personal comments left on pages.

Why?

First, it's the right thing to do. Yesterday we talked about reciprocity and social justice. An older text says “Thou shalt not muzzle the oxen that tread out the corn.”

Crowdsourced transcription projects vary a lot on this. For wiki-like systems, displaying volunteer transcripts is built into the system – I know that's the case for FromThePage, TranscribeBentham and WikiSource, and suspect the same applies to Scripto and DIYHistory. For others, users can't even see their own contributions after they have submitted them. However, the Smithsonian Institute Transcription Center actually added this feature on purpose – the team implementing the center added the ability for users to download PDFs of transcribed documents specifically because they felt it was the Right Thing to Do.

Now that I've quoted the Bible, let's talk about purely instrumental reasons crowdsourcing projects should return volunteers' labor to them.

Incentives

For one thing, exposing the raw data early can better align our projects with the incentives that motivate many volunteers. Most volunteers are not participating because of their affiliation with an institution, nor because they treasure clean library metadata – at least not primarily! What keeps them coming back and contributing is their connection to the material – an intrinsic motivation of experiencing life as a bird-watcher in the 1920s, of marching alongside a Civil War soldier as they transcribe observation cards or diaries.

We should expose the texts volunteers have worked on in ways that are immediately usable to them – PDFs they can print out, texts they can email, URLs they can post on Facebook—to show their friends and families just what they've been up to, and why they're so excited to volunteer.

In some cases this may provide extrinsic rewards project managers can't envision. One of the first projects I worked on, the Zenas Matthews diary of the Mexican-American War—attracted a super-volunteer early on who transcribed the entire diary in two weeks. When I interviewed Scott Patrick, I learned that the biggest reward we could provide – the thing he'd treasure above over badges or leader boards – would be the text itself in a printable and publishable format. You see, Mr. Patrick's heritage organization formally recognizes members who have written books, including editions of primary sources. His contribution to the project certainly matched his fellows' for quality, but access to a usable form of the text—the text he'd transcribed himself—was the thing that stood in his way.

Recruitment

Exposing raw transcripts online during the crowdsourcing process can actually enhance recruitment to crowd-sourcing projects. I've seen this in a personal project I worked on. in which one super-volunteer found the project by Googling his own name. You see, a previous volunteer had transcribed a lot of material that mentioned the a letter carrier named Nat Wooding. So when Nat Wooding did a vanity search, he found the transcribed diaries, recognized the letter carrier as his great-uncle, and became a major contributor to the project. Had the user-generated transcripts been locked away for expert review, or even published online somewhere outside of the crowdsourcing tool, we would have missed the contributions of a new super-volunteer.

Engagement

For the past three years, I've been involved with an non- called Free UK Genealogy. They have volunteers around the world transcribe genealogical records using offline, spreadsheet-like tools so that they can be searched on a freely accessible website.

I spent several months building a new system for crowd-sourced transcription of parish registers, but encountered very little enthusiasm—actually some outright opposition—from the most active volunteers. They were used to their spreadsheets, and saw no value at all to changing what they were doing.

Eventually, we switched from improving the transcription tool-chain to improving the delivery system. We re-wrote the public-facing search engine from scratch, focusing on the product visible to the volunteers and their communities. When we launched the site in April, it received the most positive reviews of any software redesign I've been involved with in two decades in the industry. Best of all—although time frame is too short to have hard numbers—the volunteer community seems to have been reinvigorated, as the FreeREG2 database passed 32 million records at the beginning of the month.

So that's my best practice: expose volunteer contributions online, within your crowdsourcing system, as they are produced. It will improve the quality and productivity of the project, and it's the right thing to do.

Collaborative Digitization at ALA 2014

noreply@blogger.com (Ben W. Brumfield) — Sun, 06 Jul 2014 19:01:00 +0000

This is a transcript of the talk I gave at the Collaborative Digitization SIG meeting at the American Library Association annual meeting on June 28, 2014 in Caesar's Palace casino in Las Vegas. I was preceded by Frederick Zarndt delivering his excellent talk on Crowdsourcing, Family History, and Long Tails for Libraries, which focused particularly on newspaper digitization and crowdsourced OCR correction. (See Laura McElfresh's notes [below] for a near-transcript of his talk.)

I'd like to thank Frederick for a number of reasons, one of them being that I don't need to define crowdsourcing, which gives me the opportunity to be a little more technical.

Before we start, I'd just like to make a quick note that all of the slides, the audio files in MP3 format, and a full transcript will be posted at my blog.

I can also direct you to the notes taken by Laura McElfresh [see pp. 19-22] over there who does an amazing job at these [conferences].

Finally, if you tweet about this, there's my handle.

Okay, so we've talked about OCR correction. What's the difference between OCR correction and manuscript transcription? Why would people transcribe manuscripts -- isn't OCR good enough?

I'd like to go into that and talk about the [effectiveness] of OCR on printed material versus handwritten materials.

We're going to go into detail on the results of running Tesseract--which is a popular, open-source OCR tool--on this particular herbarium specimen label.

I chose this one because it's got a title in print up here at the top, and then we've got a handwritten portion down here at the bottom.

So how does Tesseract do with these pieces?

With the print, it does a pretty good job, right? I mean, even though this is sort of an antique typeface, really every character is correct except that this period over here--for some reason--is OCRed as a back-tick.

So it's getting one character wrong out of--fifty, perhaps?

So how about the handwritten portion? What do you get when you run the same Tesseract program on that?

So here's the handwritten stuff, and the results are -- I'm actually pretty impressed -- I think it got the "2" right.

So in this case it got one character right out of the whole thing. So this is actually total garbage.

And my argument is that the quantitative difference in accuracy of OCR software between script versus print actually results in a qualitative difference between these two processes.

This has implications.

One of them is on methodology, which is that--as we've demonstrated--we can't use software to automatically transcribe (particularly joined-up, cursive) writing. You have to use humans.

There are a couple of other implications too, that I want to dive into a bit deeper.

One of them is the goal of the process. In the case of OCR correction, we're talking about improving accuracy of something that already exists. In the case of manuscript transcription, we're actually talking about generating a (rough) transcript from scratch.

The second one comes down to workflow, and I'll go into that in a minute.

Let's talk about findability.

Right now, if you put this page online--this manuscript image--no-one's going to find it. No-one's going to read it. Because Google cannot crawl it -- these are not words to Google, these are pixels. And without a transcript, without that findability, you miss out on the amazing serendipity that is a feature of the internet age. We don't have the serendipity of spotting books shelved next to each other anymore, but we do have the serendipity of--in this case--of a retired statistical analyst named Nat Wooding doing a vanity search on his name. And encountering a transcript of this diary--my great-great grandmother's diary--mentioning her mailman, Nat Wooding--and realizing that this is his great uncle.

Having discovered this, he started contributing to the project--not financially, but he went through and transcribed an entire year's worth of diaries. So he's contributing his labor.

Other people who've encountered these have made different kinds of contributions. These diaries were distributed on my great-great grandmother's death among her grandchildren. So they were scattered to the four winds. After putting these online, I received a package in the mail one day containing a diary from someone I'd never met, saying "looks like you'll do more with this than I will. So this element of user engagement in this case is bringing the collection back together.

Let's talk about the implications on workflow.

This is--I'm not going to say a typical--OCR correction workflow. The thing that I want to draw your attention to is that OCR correction of print can be done at a very fine grain. The National Library of Finland's Digital Koot project is asking users to correct a small block of text: a single word, a single character even. This lends itself to gamification. It lends itself to certain kinds of quality control, in which maybe you show the same image to multiple people and compare them to see if they match.

That really doesn't work very well with handwritten text, because readers have to get used to a script. Context is really important! And you find this when you put material online: people will go through and transcribe a couple of pages, then say "Oh, that's a 'W'!" And they go back and [correct earlier pages].

I want to tell the story of Page 19. This was a project that was a collaboration between me (and the FromThePage platform) and the Smith Library Special Collections at Southwestern University in Georgetown (Texas). They put a diary of a Texas volunteer in the Mexican-American War online--his name was Zenas Matthews. They found one volunteer who came online and transcribed the whole thing. He added all these footnotes. He did an amazing job.

But let's look at the edit history of one page, and what he did.

We put the material online in September. Two months later, he discovers it, and transcribes it in one session in the morning. Then he comes back in the afternoon and makes a revision to the transcript.

Time passes. Two weeks go by, and he's going back [over the text]. He makes six more revisions in one sitting on December 8, then he makes two more revisions on the next morning. Then another eight months go past, and he comes back in August in the next year, because he's thought of something -- he's reviewing his work and he improves the transcription again. He ends up with [an edition] that I'd argue is very good.

Well, this is very different from the one-time pass of OCR correction. This is, in my opinion, a qualitative difference. We have this deep, editorial approach with crowdsourced transcription.

I'm a tool maker; I'm a tool reviewer, and I'm here to try to give you some hands-on advice about choosing tools and platforms for crowdsourced transcription projects.

Now, I used to go through and review [all of the] tools. Well, I have some good news, which is that there are a lot of tools out there nowadays. There are at least thirty-seven that I'm aware of. Many of them are open source. The bad news is that there are thirty-seven to choose from, and many of them are pretty rough.

So instead of talking about the actual tools, I'm going to direct you to a spreadsheet -- a Google Doc that I put together that is itself crowdsourced. About twenty people have contributed their own tools, so it's essentially a registry of different software platforms for [crowdsourced transcription].

Instead, I'm going to discuss selection criteria -- things to consider when you're looking at launching a crowdsourced transcription project.

The first selection criterion is to look at the kind of material you're dealing with. And there are two broad divisions in source material for transcription.

This top image is a diary entry from Viscountess Emily Anne Strangford's travels through the Mediterranean in the 1850s. The bottom image is a census entry.

These are very different kinds of material. A plaintext transcript that could be printed out and read in bed is probably the [most appropriate purpose] for a diary entry. Wheras, for a census record, you don't really want plaintext -- you want something that can go into a structured database.

And there are a limited number of tools that nevertheless have been used very effectively to transcribe this kind of structured data. FamilySearch Indexing is one that we're all familiar with, as Frederick mentioned it. There are a few others from the Citizen Science world: PyBossa comes from the Open Knowledge Foundation, and Scribe and Notes From Nature both come out of GalaxyZoo. [The Zooniverse/Citizen Science Alliance.] I'm going to leave those, and concentrate on more traditional textual materials.

One of the things you want to ask is, What is the purpose of this transcript? Is mark-up necessary? These kinds of texts, as we're all aware, are not already edited, finished materials.

Most transcription tools which exist ask users for plain-text transcripts, and that's it. So the overwhelming majority of platforms support no mark-up whatsoever.

However, there are two families of mark-up [support] which do exist. One of them is a subset of TEI markup. It's part of this TEI Toolbar which was developed by Transcribe Bentham for their own platform [the Bentham Transcription Desk] which is a modification of MediaWiki. It then was later repurposed by the 1916 Letters project and used on top of a totally different software stack, ~~the NARA Transcribr Drupal module~~ [actually DIYHistory]. And what it does is give users a mall series of buttons which can be used to mark up features within a text. So this is really useful if you're dealing with marginalia, with additions and deletions within the text, and you want to track all that. Not everybody wants to track all that, but if that's the kind of purpose that you have, you'll want to look at in-page mark-up.

The other form of mark-up is one that I've been using in FromThePage, using wiki-links to do subject identification within the text. [2-3 sentences inaudible: see "Wikilinks in FromThePage" for a detailed presentation given at the iDigBio Original Sources Digitization Workshop.]

What this means is that if users encounter "Irvin Harvey" and it's marked up like this:

The tool will automatically generate an index that shows every time that Irvin Harvey was mentioned within the texts, or read all the pages mentioning Irvin Harvey. You can actually do network analysis and other digital humanities stuff based on [mining the subject mark-up].

So that's a different flavor of mark-up to consider.

Another question to ask is, how open is your project? Right now I know of projects that are using my own FromThePage tool entirely for staff to use internally.

There are others in which they have students working on the transcripts. And in some cases, this is for privacy reasons. For example, Rhodes College Libraries is using FromThePage to transcribe the diaries of Shelby Foote. Well, Shelby Foote only died a few years ago. [His diaries] are private. So this installation is entirely internal. The transcriptions are all done by students. I've never seen it -- I don't have access to it because it's not on the broad Internet.

Then there's the idea of leveraging your own volunteers on-site, with maybe some [ancillary] openness on the Internet. San Diego Natural History Museum is doing this with the people who come in, and ordinarily will volunteer to clean fossils or prepare specimens for photographs. Well, now they're saying Can you transcribe these herpetology field notes?

So these kinds of platforms are not only wide-open crowdsourcing tools; they can be private, and you should consider this. In some cases, the same platform can support both private projects and crowdsourced projects simultaneously, so you can get all of your data in the same place. [One sentence inaudible.]

Branding! Branding may be very important.

Here are a couple of platforms, with screenshots of each.

The first one is is the French-language version of Wikisource. Wikisource is a sister project to Wikipedia that was spun off around 2003 that allows people to transcribe documents and do OCR correction both. This is being used by the Departmental Archives of Alpes-Maritimes to transcribe a set of journals of episcopal visits. The bishop in the sixteenth century would go around and report on all the villages [in his diocese], so there's all this local history, but it's also got some difficult paleography.

So they're using Wikisource, which is a great tool! It has all kinds of version control. It has ways to track proofreading. It does an elegant job of putting together indiviual pages into larger documents. But, do you see "Departmental Archives of Alpes-Maritimes" on this page? No! You have no idea [who the institution is]. Now, if they're using this internally, that may be fine -- it's a powerful tool.

By contrast, look at the Letters of 1916. [Three sentences inaudible.] This is public engagement in a public-facing site.

Most platforms are somewhere between the two.

Integration: Let's say you've just done a lot of work to scan a lot of material, gather item-level metadata, and you've [ingested it] into CONTENTdm or another CMS. Now you want to launch a crowdsourcing project. Often, the first thing you have to do is get it all back out again and put it into your crowdsourcing platform.

So you need to look at integration. You need to ask the questions, How am I going to get data into the transcription platform? How am I going to get data back out? These may be totally different things: I know of one project that's trying to get data from Fedora into FromThePage, then trying to get it out of FromThePage by publishing to Omeka. There's a different project that wants to get data from Omeka into FromThePage. But these are totally different code paths! They have nothing to do with each other, believe it or not. So you really have to ask detailed questions about this.

Here are a few of the tools that exist, with what they support. (Or what they plan to support -- last week I was contacted about Fedora support and CONTENTdm support for FromThePage, one on Wednesday and one on Thursday, so if anyone has any advice on integration with those systems, please let me know.)

Hosting: Do you want to install everything on-site? Do you have sysadmins and servers? Is this actually a requirement? Or do you want this all hosted by someone else?

Right now you have pretty limited options for hosting. Notes from Nature and the GalaxyZoo projects host everything themselves. Wikisource and FromThePage can be either local or hosted. Everything else, you've got to download and get running on your servers.

Finally, I'd like to talk a little bit about asking yourself, what are yardsticks for success?

If you're doing this for volunteer engagement, what does successful engagement look like? I know of one project that launched a trial in which they put some material from 19th century Texas online. One volunteer found this and dove into it. He transcribed a hundred pages in a week, he started adding footnotes -- I mean he just plowed through this. After a couple of weeks, the librarians I was working with cancelled the trial, and I asked them to give me details. One of the things that they said was, We were really disappointed that only one volunteer showed up. Our goal for public engagement was to do a lot of public education and public outreach, and we wanted to reach out [to] a lot of people.

[For them,] a hundred pages transcribed by one volunteer is a failure compared with one page each transcribed by ten volunteers. So what are your goals?

Similarly, if you're using a platform that is a wiki-like platform--an editorial platform--you'll get obsessive users who will go back and revise page 19 over and over again. That may be fine for you. Maybe you want the highest quality transcripts and you don't mind that there's sort of spotty coverage because users come in and only transcribe the things that really interest them.

Other systems try to go for coverage over quality and depth. ProPublica developed the transcribable Ruby on Rails plugin for research on campaign contributions. They intentionally designed their tool with no back button -- there's no way for a user to review what they did. And they wrote a great article about this which is very relevant to this conference venue: it's called "Casino-Driven Design: One Exit, No Windows, Free Drinks". So for them, the page 19 situation would be an absolute failure, while for me I'm thrilled with it. So again there's this trade off of quality versus quantity in product as well as in engagement.

[Audio to follow.]

Wikilinks in FromThePage

noreply@blogger.com (Ben W. Brumfield) — Fri, 14 Mar 2014 18:47:00 +0000

From March 10-12, I got to participate in the iDigBio Original Sources Digitization Workshop, a gathering of natural history collections managers, archivists, and technologists. Although the focus of digitization within natural history has been on specimens or specimen labels, this workshop sought to address the challenges and opportunities involved in digitizing ledgers, field notes, and other non-specimen data. As usual for iDigBio events, the workshop was spectacular.

Carolyn Sheffield chaired a panel (video recording) on crowdsourcing which included Rob Guralnik discussing Notes From Nature, Christina Fidler talking about the Grinnell field notes on FromThePage, my talk, and a long, valuable discussion among all participants. My presentation covered the data model and uses of wiki links as I'm using them in FromThePage.

Video, slides, and transcript are below:

"From The Page" - Ben Brumfield from iDigBio on Vimeo.

I'm Ben Brumfield. You saw a little bit about FromThePage in Christina Fidler's presentation, so I wanted to talk about the internals -- the design and the datastructures behind some of the things that make this a little bit different from NotesFromNature or the NARA Transcribr Drupal module.

This is the transcription screen. You've seen this with Christina, so I'll probably go over this pretty quickly. This is a full-text transcription, not individual records like you get with Notes From Nature.

The reason for that is that FromThePage was built to be a wiki-like tool, purpose-built for creating amateur editions. So we've got a text and we want to create an edition from the text that can then be re-used, printed, and analyzed.

I say "amateur" editions because we're not dealing with the kinds of things that textual scholars in the humanities are dealing with, where they're trying to compare different variant manuscript versions of Chaucer. [By contrast, we] have something that's very straightforward, and we're interested in some fairly simple annotations.

It's purpose-built -- free-standing on MySQL and Ruby on Rails, so it's not integrated with MediaWiki or anything like that.

So who's using it?

[FromThePage] was built originally for a set of my great-great grandmother's diaries.

Since then it's been used for military diaries by libraries and history departments.

It's been used for literary diaries--in this case for Shelby Foote's diaries--for literary drafts, and for punk rock fanzines. (Which is kind of awesome!)

So what does that have to do with the people in this room and the kind of material [we're working with]?

Here's an example: This is an 1859 journal from an expedition in which someone went out and made a number of observations and collected some things to bring back with them. There are scholars interested in mining those.

But it's not a naturalist expedition. This is Viscountess Emily Anne Smyth Strangford, who in this case is touring the Mediterranean and visiting a lot of classical monuments. The folks at the Duke Computational Classics Collaboratory are interested in finding all the places in which she recorded Latin and Greek inscriptions, coming up with her itenerary, and figuring out how [that data] connects to the objects her father-in-law had collected for the British Museum twenty years earlier.

So there's a lot of correspondence, I tend to think, with field notes.

The San Diego Natural History Museum started using FromThePage for field books in 2010. They're still working on the project.

They've identified ten thousand subjects worth classifying in their system.
Individual pages have been edited twenty-four thousand times. And this goes back to the wiki-like approach -- people transcribe a page, and then they revisit it. They make a number of edits to a page as they get comfortable with the handwriting.
And then they've linked individual observations, species mentioned, and people in the field notes to those subjects forty-two thousand times.

Then there are a couple of other projects working with field notes. [Museum of Vertebrate Zoology] obviously is in trial, and [the Museum of Comparative Zoology] and Missouri Botanical Gardens are just evaluating the software right now.

So, what is a wiki link?

Any of us who've edited Wikipedia may be used to this. I followed the same syntax [in FromThePage].

What we have here is a set of double square braces with the canonical name of the subject--this could be a formatted date, this could be a full name that's spelled out--and then the text that's actually used within the verbatim transcript.

So our example here -- this is when Grinnell meets Klauber. The field note actually says "L. M. Klauber", so the person transcribing has expanded this out to "Laurence M. Klauber". So we have the ability to handle variance in references to Klauber, but still identify them as Klauber.

Technically speaking, what's behind one of these wiki links?

There are a lot of tables in this database.

We know that there's this page that Klauber is mentioned on. It's S1 Page 3 in the Grinnell field notes that MVZ has online.
We've got a subject which is Laurence M. Klauber.
The subject is categorized as a person, which can be used for analysis and filtering, like Christina showed you.
And then the individual link between the page and the subject, that contains the variation, is also stored.

So there are a lot of things you can do with that.

You can show all the pages that mention Laurence M. Klauber, and read the pages in context or just get a listing of them.
More helpfully, as you're transcribing we can mine those links to automatically suggest mark-up. So the next time we encounter "L. M. Klauber", we can push a button and that will automatically expand the mark-up of "L. M. Klauber" to "[[Laurence M. Klauber|L. M. Klauber]]".
You can also feed this to full-text searches. So if you've got a lot of plain-text transcripts which contain Laurence M. Klauber, we can automatically populate the search with those variations, creating an OR query with "Klauber", "L. M. Klauber"
And then we can mine the mark-up for correspondences [between subjects] as Christina showed.

The last thing you can do with it is export.

Here is a TEI-XML export of the Joseph Grinnell notes. This is useful for interchange, but the most important thing this does is that it allows amateurs to create well-formatted, TEI P5-compliant XML. And it will handle one of the things that's very hard about creating TEI in an XML editor, which is associating reference string to their entries over in the TEI header which describes who the people are outside the text.

This is a CSV export of the Grinnell field notes. Basically this is every observation and every person who's mentioned, exported as a CSV file with links back to the pages and URLs at which those pages can be found. This is the kind of thing that perhaps could be ingested into [museum collection management database] Arctos.

Future plans:

We're going to be doing more CMS integrations. We're working on Omeka. The Internet Archive is done. There are a couple of grant applications that involve hooking FromThePage up to Fedora Commons.

We also really want to contextualize links in time and place. We want the ability for people to define where the person writing the journal is where they're writing, and then to apply those geotags and chronotags to the references. So you could map when species were mentioned. You could extract a visual itenerary.

We need more formatting options. One of our volunteers has found all kinds of crazy editorial issues for handling strike-outs and things like that.

And the last thing that we're looking for is more projects.

Code and Conversations in 2013

noreply@blogger.com (Ben W. Brumfield) — Wed, 01 Jan 2014 05:30:00 +0000

It's often hard to explain what it is that I do, so perhaps a list of what I did will help. Inspired by Tim Sherratt's "talking" and "making" posts at the end of 2012, here's my 2013.

Code

I work on a number of software projects, whether as contract developer, pro bono "code fairy", or product owner.

FromThePage

It's been a big year for FromThePage, my open-source tool for manuscript transcription and annotation. We started work upgrading the tool to Rails 3, and built a TEI Export (see discussion on the TEI-L) and an exploratory Omeka integration. Several institutions (including University of Delaware and the Museum of Vertebrate Zoology) launched trials on FromThePage.com for material ranging from naturalist field notes to Civil War diaries. Pennsylvania State University joined the ranks of on-site FromThePage installations with their "Zebrapedia", transcribing Philip K. Dick's Exegesis -- initially as a class project and now as an ongoing work of participatory scholarship.

One of the most interesting developments of 2013 was that customizations and enhancements to FromThePage were written into three grant applications. These enhancements--if funded--would add significant features to the tool, including Fedora integration, authority file import, redaction of transcripts and facsimiles, and support for externally-hosted images. All these features would be integrated into the FromThePage source, benefiting everybody.

Two other collaborations this year promise interesting developments in 2014. The Duke Collaboratory for Classics Computing (DC3) will be pushing the tool to support 19th-century women's travel diaries and Byzantine liturgical texts, both of which require more sophisticated encoding than the tool currently supports. (Expect Unicode support by Valentine's Day.) The Austin Fanzine Project will be using a new EAC-CPF export which I'll deliver by mid-January.

OpenSourceIndexing / FreeREG 2

Most of my work this year has been focused on improving the new search engine for the twenty-six million church register entries the FreeREG organization has assembled in CSV files over the last decade and a half. In the spring, I integrated the parsed CSV records into the search engine and converted our ORM to Mongoid. I also launched the Open Source Indexing Github page to rally developers around the project and began collecting case studies from historical and genealogical organizations.

In May, I built a parser for historical dates into the search engine I'm building for FreeREG. It handles split dates like "4 Jan 1688/9", illegible date portions in UCF like "4 Jan 165_", and preserves the verbatim transcription as well as programmatically handling searching and sorting correctly. Eventually I'll incorporate this into an antique_date gem for general use.

Most of the fall was spent adding GIS search capabilities to the search engine. In fact, my last commit of the year added the ability to search for records within a radius of a place. The new year will bring more developments on GIS features, since an effective and easy interface to a geocoded database is just as big a challenge as the geocoding logic itself.

Other Projects

In January I added a command-line wrapper to Autosplit, my library for automatically detecting the spine in a two-page flatbed scan and splitting the image into recto and verso halves. In addition to making the tool more usable, it also added support for notebook-bound books which must be split top-to-bottom rather than left-to-right.

For the iDigBio Augmenting OCR Hackathon in February, I worked on two exploratory software projects. HandwritingDetection (code, write-up) analyzes OCR text to look for patterns characteristically produced when OCR tools encounter handwriting. LabelExtraction (code, write-up) parses OCR-generated bounding boxes and text to identify labels on specimen images. To my delight, in October part of this second tool was generalized by Matt Christy at the IDHMC to illustrate OCR bounding boxes for the eMOP project's work tuning OCR algorithms for Early Modern English books.

In June and July, I started working on the Digital Austin Papers, contract development work for Andrew Torget at the University of North Texas. This was what freelancers call a "rescue" project, as the digital edition software had been mostly written but was still in an exploratory state when the previous programmer left. My job was to triage features, then turn off anything half-done and non-essential, complete anything half-done and essential, and QA and polish core pieces that worked well. I think we're all pretty happy with the results, and hope to push the site to production in early 2014. I'm particularly excited about exposing the TEI XML through the delivery system as well as via GitHub for bulk re-use.

Also in June, I worked on a pro bono project with the Civil War-era census and service records from Pittsylvania County, Virginia which were collected by Jeff McClurken in his research. My goal is to make the PittsylvaniaCivilWarVets database freely available for both public and scholarly use. Most of the work remaining here is HTML/CSS formatting, and I'd welcome volunteers to help with that.

In November, I contributed some modifications to Lincoln Mullen's Omeka client for ruby. The client should now support read-only interactions with the Omeka API for files, as well as being a bit more robust.

December offered the opportunity to spend a couple of days building a tool for reconciling multi-keyed transcripts produced from the NotesFromNature citizen science UI. One of the things this effort taught me was how difficult it is to find corresponding transcript to reconcile -- a very different problem from reconciliation itself. The project itself is over, but ReconciliationUI is still deployed on the development site.

Conversations

February 13-15 -- iDigBio Augmenting OCR Hackathon at the Botanical Research Institute of Texas. "Improving OCR Inputs from OCR Outputs?" (See below.)

February 26 -- Interview with Ngoni Munyaradzi of the University of Cape Town. See our discussion of his work with Bushman languages of southern Africa.

March 20-24 -- RootsTech in Salt Lake City. "Introduction to Regular Expressions"

April 24-28 -- International Colloquium Itinera Nova in Leuven, Belgium. "Itinera Nova in the World(s) of Crowdsourcing and TEI".

May 7-8 -- Texas Conference on Digital Libraries in Austin, Texas. I was so impressed with TCDL when Katheryn Stallard and I presented in 2012 that I attended again this year. While I was disappointed to miss Jennifer Hecker's presentation on the Austin Fanzine Project, I was so impressed with Nicholas Woodward's talk in the same time slot that I talked him into writing it up as a guest post.

May 22-24 -- Society of Southwestern Archivists Meeting in Austin, Texas. On a fun panel with Jennifer Hecker and Micah Erwin, I presented "Choosing Crowdsourced Transcription Platforms"

July 11-14 -- Social Digital Scholarly Editing at the University of Saskatchewan. A truly amazing conference. My talk: "The Collaborative Future of Amateur Editions".

July 16-20 -- Digital Humanities at the University of Nebraska, Lincoln. Panel "Text Theory, Digital Document, and the Practice of Digital Editions". My brief talk discussed the importance of blending both theoretical rigor and good usability into editorial tools.

July 23 -- Interview with Sarah Allen, Presidential Innovation Fellow at the Smithsonian Institution. Sarah's notes are at her blog Ultrasaurus under the posts "Why Crowdsourced Transcription?" and "Crowdsourced Transcription Landscape".

September 12 -- University of Southern Mississippi. "Crowdsourcing and Transcription". An introduction to crowdsourced transcription for a general audience.

September 20 -- Interview with Nathan Raab for Forbes.com. Nathan and I had a great conversation, although his article "Crowdsourcing Technology Offers Organizations New Ways to Engage Public in History" was mostly finished by that point, so my contributions were minor. His focus on the engagement and outreach aspects of crowdsourcing and its implications for fundraising is one to watch in 2014.

September 25 -- Wisconsin Historical Society. "The Crowdsourced Transcription Landscape". Same presentation as USM, with minor changes based on their questions. Contents: 1. Methodological and community origins. 2. Volunteer demographics and motivations. 3. Accuracy. 4. Case study: Harry Ransom Center Manuscript Fragments. 5. Case study: Itinera Nova at Stadarchief Leuven.

September 26-27 -- Midwest Archives Conference Fall Symposium in Green Bay, Wisconsin. "Crowdsourcing Transcription with Open Source Software". 1. Overview: why archives are crowdsourcing transcription. 2. Selection criteria for choosing a transcription platform. 3. On-site tools: Scripto, Bentham Transcription Desk, NARA Transcribr Drupal Module, Zooniverse Scribe. 4. Hosted tools deep-dive: Virtual Transcription Laboratory, Wikisource, FromThePage.

October 9-10 -- THATCamp Leadership at George Mason University. In "Show Me Your Data", Jeff McClurken and I talked about the issues that have come up in our collaboration to put online the database he developed for his book, Take Care of the Living. See my summary or the expanded notes.

November 1-2 -- Texas State Genealogy Society Conference in Round Rock, Texas. Attempting to explore public interest in transcribing their own family documents, I set up as an exhibitor, striking up conversations with attendees and demoing FromThePage. The minority of attendees who possessed family papers were receptive, and in some cases enthusiastic about producing amateur editions. Many of them had already scanned in their family documents and were wondering what to do next. That said, privacy and access control was a very big concern -- especially with more recent material which mentioned living people.

November 7 -- THATCamp Digital Humanities & Libraries in Austin, Texas. Great conversations about CMS APIs and GIS visualization tools.

November 19-20 -- Duke University. I worked with my hosts at the Duke Collaboratory for Classics Computing to transcribe a 19th-century travel diary using FromThePage, then spoke on "The Landscape of Crowdsourcing and Transcription", an expansion of my talks at USM and WHS. (See a longer write-up and video.)

December 17-20 -- iDigBio Citizen Science Hackathon. Due to schedule conflicts, I wasn't able to attend this in person, but followed the conversations on the wiki and the collaborative Google docs. For the hackathon, I built ReconciliationUI, a Ruby on Rails app for reconciling different NotesFromNature-produced transcripts of the same image on the model of FamilySearch Indexing's arbitration tool.

2014

All these projects promise to keep me busy in the new year, though I anticipate taking on more development work in the summer and fall. If you're interested in collaborating with me in 2014--whether to give a talk, work on a software project, or just chat about crowdsourcing and transcription--please get in touch.

"The Landscape of Crowdsourcing and Transcription" at Duke University

noreply@blogger.com (Ben W. Brumfield) — Sat, 23 Nov 2013 13:28:00 +0000

I spent part of this week at Duke University with the Duke Collaboratory for Classics Computing -- Josh Sosin, Hugh Cayless, and Ryan Baumann. We discussed ideas for mobile epigraphy applications, argued about text encoding, and did some hacking. We loaded an instance of FromThePage onto the DC3's development machine, seeded it with the 1859 journal of Viscontess Emily Anne Beaufort Smyth Strangford (part of Duke Libraries' amazing collection of Women's Travel Diaries). Transcribing six pages of her tour through Smyrna and Syria together suggested some exciting enhancements for the transcription tool, revealing a few bugs along the way. I'm really looking forward to collaborating with the DC3 on this project.

On Wednesday, I gave an introductory talk on crowdsourced manuscript transcription at the Perkins Library: "The Landscape of Crowdsourcing and Transcription":

One of the most popular applications of crowdsourcing to cultural heritage is transcription. Since OCR software doesn’t recognize handwriting, human volunteers are converting letters, diaries, and log books into formats that can be read, mined, searched, and used to improve collection metadata. But cultural heritage institutions aren’t the only organizations working with handwritten material, and many innovations are happening within investigative journalism, citizen science, and genealogy.

This talk will present an overview of the landscape of crowdsourced transcription: where it came from, who’s doing it, and the kinds of contributions their volunteers make, followed by a discussion of motivation, participation, recruitment, and quality controls.

The talk and visit got a nice write-up in Duke Today, which includes this quote by Josh Sosin:

Sosin said that although many students and professors visit the library's collections and partially transcribe the sources that are pertinent to their research, nearly all of these transcripts disappear once the researchers leave the library.

"Scholars or students come to the Rubenstein, check out these precious materials, they transcribe and develop all sorts of interesting ideas about them," Sosin said. "Then they take their notebooks out of the library and we lose all the extra value-added materials developed by these students. If we can host a platform for students and scholars to share their notes and ideas on our collections, the library's base of knowledge will grow with every term paper or book that our scholars produce."

Video of "The Landscape of Crowdsourcing and Transcription" (by Ryan Baumann):

Slides from the talk:

Previous versions of this talk were delivered at University of Southern Mississippi (2013-09-12) and the Wisconsin Historical Society (2013-09-25). It differs substantially in the discussion of quality control mechanisms (on the video from 26:15 through 31:30, slides 37-40), an addition which was suggested by questions posed at USM and WHS.

Feature: TEI-XML Export

noreply@blogger.com (Ben W. Brumfield) — Fri, 25 Oct 2013 14:51:00 +0000

How do you get the data out?

This is a question I hear pretty often, particularly from professional archivists. If an institution and its users have put the effort into creating digital editions on FromThePage, how can they pull the transcripts out of FromThePage to back it up, repurpose it, or import it into other systems?

This spring, I created an XHTML exporter that will generate a single-page XHTML file containing transcripts of a work's pages, their version history, all articles written about subjects within the work, and internally-linked indices between subjects and pages. Inspired by conversations at the TEI and SDSE conferences and informed by my TEI work for a client project, I decided to explore a more detailed export in TEI.

This is the result, posted on github for discussion:

https://gist.github.com/benwbrum/6933615
Zenas Matthews' Mexican War Diary was scanned and posted by Southwestern University's Smith Library Special Collections. It was transcribed, indexed, and annotated by Scott Patrick, a retired petroleum worker from Houston.

https://gist.github.com/benwbrum/6933603
Julia Brumfield's 1919 Diary was scanned and posted by me, transcribed largely by volunteer Linda Tucker, and indexed and annotated by me.

I requested comment on the TEI mailing list (see the thread "Draft TEI Export from FromThePage"), and got a lot of really helpful, generous feedback both on- and off-list. It's obvious that I've got more work to do for certain kinds of texts--which will probably involve creating a section header notation in my wiki mark-up--but I'm pretty pleased with the results.

One of the most exciting possibilities of TEI export is interoperability with other systems. I'd been interested in pushing FromThePage editions to TAPAS, but after I posted the TEI-L announcement, Peter Robinson pulled some of the exports into Textual Communities. We're exploring a way to connect the two systems, which might give editors the opportunity to do the sophisticated TEI editing and textual scholarship supported by Textual Communities starting from the simple UI and powerful indexing of FromThePage. I can imagine an ecosystem of tools good at OCR correction, genetic mark-up, display and analysis of correspondence, amateur-accessible UIs, or preservation -- all focusing on their strengths and communicating via TEI-XML.

I'm interested in more suggestions for ways to improve the exports, new things to do with TEI, or systems to explore integration options before I deploy the export feature on production.

A Gresham's Law for Crowdsourcing and Scholarship?

noreply@blogger.com (Ben W. Brumfield) — Sun, 20 Oct 2013 13:15:00 +0000

This is a comment I wanted to make at Neil Fraistat's "Participatory DH" session (proposal, notes) at THATCamp Leadership, but ended up having on twitter instead.

Much of the discussion in the first half of the session focused on the qualitative difference between the activities we ask amateurs to do and the activities performed by scholars. One concern voiced was that we're not asking "citizen scholars" to do real scholarly work, and then labeling their activity scholarship -- a concern I share with regard to editing. If most crowdsourcing projects ask amateurs to do little more than wash test tubes, where are the projects that solicit scholarly interpretation?

The Harry Ransom Center's Manuscript Fragments Project is just such a crowdsourcing project, and I think the results may be disquieting. In this project, fragments of medieval manuscripts reused as binding for printed books are photographed and posted on Flickr. Volunteers use the comments to identify the fragments, discussing the scribal hand and researching the source texts. I'd argue that while this does not duplicate the full range of an academic medievalist's scholarly activities, it's certainly not just "bottle-washing" either.

The project has been very successful. (See organizer Micah Erwin's talks for details.) Most of the contributions to the project have been made on Flickr in the comments by a few "super volunteers" -- retired rare book dealers and graduate students among them. However, around 20% of the identifications were made by professional medievalists who learned about the project, visited the Flickr site, and then called or emailed the project organizer. None of their contributions were made on the public Flickr forum at all.

So why did professional scholars avoid contributing in public? I related this on Twitter, and got some interesting suggestions

Either the forum (flickr) or the 'public' nature of the project made them lurkers and silent back-channel (email/phone) contributors.
— Ben W. Brumfield (@benwbrum) October 15, 2013

@benwbrum noticed phenomnw/ #keteweb collab. Many librarians seemed scared to put professional rep on the line by asking questions publicly.
— Walter McGinnis (@wtem) October 15, 2013

Maybe if you do involve/invite the public, you drive out the professionals for social reasons? #crowdsourcing #digitalhumanities
— Ben W. Brumfield (@benwbrum) October 15, 2013

@benwbrum or social networks of academia haven't yet made transition into digital? Also issues of peer review/perf frameworks. Interesting.
— Alexandra Eveleigh (@ammeveleigh) October 15, 2013

@benwbrum Does elitism factor in? Academics don't want to participate in a public comments section, to be seen as "non-scholarly?"
— Vitoriohno (@vitor_io) October 15, 2013

@benwbrum Or ego preservation? If "a bunch of amateurs" can do the same work at the same quality, what good was all this education?
— Vitoriohno (@vitor_io) October 15, 2013

@benwbrum Or effort avoidance? If a public discussion by amateurs *is* scholarly, then do they have to cite it? Provide credit? Preserve it?
— Vitoriohno (@vitor_io) October 15, 2013

Many of these suggest a sort of Gresham's Law of crowdsourcing, in which inviting the public to participate in an activity lowers that activity's status, driving out professionals concerned with their reputation.

There's a more reassuring explanation as well -- many people with domain expertise still aren't very comfortable with technology. Asking them to use a public forum puts additional pressure on them, as any mistakes typing, encoding, and using the forum will be public and likely permanent. This challenge is not confined to professionals, either -- I receive commentary on the Julia Brumfield Diaries via email from people without high school degrees, who have no professional reputation to protect.

University of Delaware and Cecil County Historical Society on FromThePage

noreply@blogger.com (Ben W. Brumfield) — Wed, 24 Jul 2013 19:12:00 +0000

Over the last few months, the University of Delaware and the Cecil County Historical Society have been using FromThePage to transcribe the diary of a minister serving in the American Civil War. They're using the project to expose undergraduates to primary sources while also improving access to an important local history document.

The county has documented the process with an extensive post on the Cecil County Historical Society Blog, which was picked up by the Cecil Daily.

The university also put together a lovely video providing background on the project and interviewing students and faculty members involved in the project:

One of the things I find most interesting about the project is the collaboration between digital humanities-focused university faculty and the county historical society:

Kasey Grier, director of the Museum Studies Program and the History Media Center at the university, says the transcription will be done by students in a process called “crowd sourcing.”

“Crowd sourcing,” according to Grier, “is when students in remote locations, review the handwritten text and try their hand at transcribing it. They then submit their contributions which are reviewed and put up online. Eventually, all of the diary entires will be available for anyone to access and read.”

Historical Society of Cecil County President Paul Newton says the society welcomes this collaboration with the University of Delaware and hopes to strengthen it because it broadens the society’s horizons and reach.

“The university’s focus is in the area of the digital humanities, which allows us to take largely unused and un-accessed collections and get the material out to a broader audience for study. It is also a preservation method as it reduces handling and makes interpretation much easier,” Grier said.

You can see the Joseph Brown Diary and the students' work on it at the project site on FromThePage.com.

The Collaborative Future of Amateur Editions

noreply@blogger.com (Ben W. Brumfield) — Sat, 13 Jul 2013 07:02:00 +0000

This is the transcript of my talk at Social Digital Scholarly Editing at the University of Saskatchewan in Saskatoon on July 11 2013.

I'm Ben Brumfield. I'm not a scholarly editor, I'm an amateur editor and professional software developer. Most of the talks that I give talk about crowdsourcing, and crowdsourcing manuscript transcription, and how to get people involved. I'm not talking about that today -- I'm here to talk about amateur editions.

So let's talk about the state of amateur editions as it was, as it is now, as it may be, and how that relates to the people in this room.

Let's start with a quote from the past. This was written in 1996, representing what I think may be a familiar sort of consensus [scholarly] opinion about the quality of amateur editions, which can be summed up in the word "ewww!"

So what's going on now? Before I start looking at individual examples of amateur editions, let's define--for the purpose of this talk--what an amateur edition is.

Ordinarily people will be talking about three different things:

They can be talking about projects like Paul's, in which you have an institution who is organizing and running the project, but all the transcription, editing, and annotation is done by members of the public.
Or, they can be talking about organizations like FreeREG, a client of mine which is a genealogy organization in the UK which is transcribing all the parish registers of baptisms, marriages, and burials from the reformation up to 1837. In that case, all the material--all the documents--are held at local records offices and and archives, who in many cases are quite hostile to the volunteer attempt to put these things online. Nevertheless, over the last fifteen years, they've managed to transcribe twenty-four million of these records, and are still going strong.
Finally, amateur run editions of amateur-held documents. These are cases like me working on my great-great grandmother's diaries, which is what got me into this world [of editing].

I'm going to limit that [definition] slightly and get rid of crowdsourcing. That's not what I want to talk about right now. I don't want to talk about projects that have the guiding hand of an institutional authority, whether that's an archive or a [scholarly] editor.

So let's take a look at amateur editions. Here's a site called Soldier Studies. Soldier Studies is entirely amateur-run. It's organized by a high-school history teacher who got really involved in trying to rescue documents from the ephemera trade.

The sources of the transcripts of correspondence from the American Civil War are documents that are being sold on E-Bay. He sees the documents that are passing through--and many of them he recognizes as important, as an amateur military historian--and he says, I can't purchase all of these, and I don't belong to an institution that can purchase them. Furthermore, I'm not sure that it's ethical to deal in this ephemera trade--there is some correlation to the antiquities trade--but wouldn't it be great if we could transcribe the documents themselves and just save those, so that as they pass from a vendor to a collector, some of the rest of us can read what's on these documents?

So he set up this site in which users who have access to these transcripts can upload letters. They upload these transcripts, and there's some basic metadata about locations and subjects that makes the whole thing searchable.

But the things that I think people in here--and I myself--will be critical about are the transcription conventions that he chose, which are essentially none. He says, correspondence can be entered as-is--so maybe you want to do a verbatim transcript, but maybe not--and the search engines will be able to handle it.

A little bit more shocking is that -- you know, he's dealing with people who have scans--they have facsimile images--so he says, we're going to use that. Send us the first page, so that we know that you're not making this piece of correspondence up completely, fabricating it out of whole cloth.

So that's not a facsimile edition, and we don't have transcription conventions. He has this caveat, in which he explains that this [site] is reliable because we have "the first page of the document attached to the text transcription as verification that it was transcribed from that source." So you'll be able to read one page of facsimile from this transcript you have. We do our best, we're confident, so use them with confidence, but we can't guarantee that things are going to be transcribed validly.

Okay, so how much use is that to a researcher?

This puts me in the mind of Peter Shillingsburg's "Dank Cellar of Electronic Texts", in which he talks about the world "being overwhelmed by texts of unknown provenance, with unknown corruptions, representing unidentified or misidentified versions."

He's talking about things like Project Gutenberg, but that's pretty much what we're dealing with right here. How much confidence could a historian place in the material on this site? I'm not sure.

Here's an example of an amateur edition which is in a noble cause, but which is really more ammunition for the earlier quote.

So what about amateur editions that are done well? This is the Papa's Diary Project, which is a 1924 diary of a Jewish immigrant to New York, transcribed by his grandson.

What's interesting about this -- he's just using Blogger, but he's doing a very effective job of communicating to his reader:

So here is a six-word entry. We have the facsimile--we can compare and tell [the transcript] is right: "At Kessler's Theater. Enjoyed Kreuzer Sonata."

So the amateur who's putting this up goes through and explains what Kessler's theater is, who Kessler was.

Later on down in that entry, he explains that Kessler himself died, and the Kreuzer Sonata is what he died listening to. Further down the page you can listen to the Kreuzer Sonata yourself.

So he's taken this six-word diary entry and turned it into something that's fascinating, compelling reading. It was picked up by the New York Times at one point, because people got really excited about this.

Another thing that amateurs do well is collaborate. Again: Papa's Diary Project. Here is an entry in which the diarist transcribed a poem called "Light".

Here in the comments to that entry, we see that Jerroleen Sorrensen has volunteered: Here's where you can find [the poem] in this [contemporary] anthology, and, by the way, the title of the poem is not "Light", but "The Night Has a Thousand Eyes".

So we have people in the comments who are going off and doing research and contributing.

I've seen this myself. When I first started work on FromThePage, my own crowdsourced transcription tool, I invited friends of mine to do beta testing.

I started off with an edition that I was creating based on an amateur print edition of the same diary from fifteen years previously.

If you look at this note here, what you see is Bryan Galloway looking over the facsimile and seeing this strange "Miss Smith sent the drugg... something" and correcting the transcript--which originally said "drugs"--saying, Well actually that word might be "drugget", and "drugget" is, if you look on Wikipedia, is a coarse woolen fabric. Which--since it's January and they're working with [tobacco] plant-beds--that's probably what it is.

Well, I had no idea--nobody who's read this had any idea--but here's somebody who's going through and doing this proofreading, and he's doing research and correcting the transcription and annotating at the same time.

Another thing that volunteers do well is translate. This is the Kriegstagebuch von Dieter Finzen, who was a soldier in World War I, and then was drafted in World War II. This is being run by a group of volunteers, primarily in Germany.

What I want to point out is, that here is the entry for New Year's Day, 1916. They originally post the German, and then they have volunteers who go online and translate the entry into English, French, and Italian.

So now, even though my German is not so hot, I can tell that they were stuck drinking grenade water.

So, what's the difference?

What's the difference between things that amateurs seem to be doing poorly, and things that they're doing well?

I think that it comes down to something that Gavin Robinson identified in a blog post that he wrote about six years ago about the difference between professional historians/academic historians and amateur historians. What he essentially says is that professionals--particularly academics, but most professionals--are particularly concerned with theory. They're concerned with their methodologies and with documenting their methodologies.

This is something that amateurs, in many cases, are not concerned with -- don't know exist -- maybe have never even been exposed to.

So, based on that, let's talk about the future.

How can we get amateurs--doing amateur editions on their own--to move from the things that they're doing well and poorly to being able to do everything well that's relevant to researchers' needs?

I see three major challenges to high-quality amateur editions.

The first one is one which I really want to involve this community in, which is ignorance of standards. The idea that you might actually include facsimiles of every page with your transcription -- that's a standard. I'm not talking about standards like TEI -- I'd love for amateur editions to be elevated to the point that print editions were in 1950 -- we're just talking about some basics here.

Lack of community and lack of a platform.

So let's talk about standards.

How does an amateur learn about editorial methodologies? How do they learn about emendations? How do they learn about these kinds of things?

Well, how do they learn about any other subject? How do they learn about dendrochronology if they're interested in measuring tree rings?

Wikipedia!

Let's go check out Wikipedia!

Wikipedia has a problem for most subjects, which is that Wikipedia is filled with jargon. If you look up dendrochronology, you don't really have a starting place, a "how to". If you look up the letter X, you get this wonderful description of how 'X' works in Catalan orthography, but it presupposes you being familiar with the International Phonetic Alphabet, and knowing that that thing which looks like an integral sign is actually the 'sh' sound.

Now if amateurs are trying to do research on scholarly editing and documentary editing in Wikipedia, they have a different problem:

There's nothing there. There's no article on documentary editing.

There's no article on scholarly editing.

These practices are invisible to amateurs.

So if they can't find the material online that helps them understand how to encode and transcribe texts, where are they going to get it?

Well--going back to crowdsourcing--one example is by participation in crowdsourcing projects. Crowdsourcing projects--yes, they are a source of labor; yes they are a way to do outreach about your material--but they are a way to train the public in editing. And they are training the public in editing whether that's the goal of the transcription project or not. The problem is that the teacher in this school is the transcription software--is the transcription website.

This means that the people who are teaching the public about transcription--the people who are teaching the public about editing--are people like me: developers.

So, how do developers learn about transcription?

Well, sometimes, as Paul [Flemons] mentioned, we just wing it. If we're lucky, we find out about TEI, and we read the TEI Guidelines, and we find out that there's so much editorial practice that's encoded in the TEI Guidelines that that's a huge resource.

If we happen to know the people in this room or the people who are meeting at the Association for Documentary Editing in Ann Arbor, we might discover traditional editorial resources like the Guide to Documentary Editing. But that requires knowing that there's a term "Documentary Editing".

So what does that mean? What that means is that people like me--developers with my level of knowledge or ignorance--are having a tremendous amount of influence on what the public is learning about editing. And that influence does not just extend to projects that I run -- that influence extends to projects that archives and other institutions using my software run. Because if an archive is trying to start a transcription project, and the archivist has no experience with scholarly editing, I say, You should pick some transcription conventions. You should decide how to encode this. Their response is, What do you think? We've never done this before. So I'm finding myself giving advice on editing.

Okay, moving on.

The other thing that amateurs need is community.

Community is important because community allows you to collaborate. Communities evaluate each [member's] work and say, This is good. This is bad. Communities teach each [member]. And communities create standards -- you don't just hang out on Flickr to share your photos -- you hang out on Flickr to learn to be a better photographer. People there will tell you how to be a better photographer.

We have no amateur editing community for people who happen to have an attic full of documents and want to know what to do with them.

So communities create standards, and we know this. Let me quote my esteemed co-panelist, Melissa Terras, who, in her interviews with the managers of online museum collections--non-institutional online "museums"--found that people are coming up with "intuitive metadata" standards of their own, without any knowledge or reference to existing procedures in creating traditional archival metadata.

The last big problem is that there's currently no platform for someone who has an attic full of documents that they want to edit. They can upload their scans to Flickr, but Flickr is a terrible platform for transcription.

There's no platform that will guide them through best practices of editing.

What's worse, if there were one, it would need a "killer feature", which is what Julia Flanders describes in the TAPAS project as a compelling reason for people to contribute their transcripts and do their editing on a platform that enforces rigor and has some level of permanence to it -- rather than just slapping their transcripts up on a blog.

So, let's talk about the future. In his proposal for this conference, Peter Robinson describes a utopia and dystopia: utopia in which textual scholars train the world in how to read documents, and a dystopia in which hordes of "well-meaning but ill-informed enthusiasts will strew the web willy-nilly with error-filled transcripts and annotations, burying good scholarship in rubbish."

This is what I think is the road to dystopia:

Crowdsourcing tools ignore documentary editing methodologies. If you're transcribing using the Transcribe Bentham tool, you learn about TEI. You learn from a good school. But almost all of the other crowdsourced transcription tools don't have that. Many of them don't even contain a place for the administrator to specify transcription conventions to their users!
As a result, the world remains ignorant of the work of scholarly editors, because we're not finding you online--because you're invisible on Wikipedia--and we're not going to learn about your work through crowdsourcing.
So you have the public get this attitude that, well, editing is easy -- type what you see. Who needs an expert? I think that's a little bit worrisome.
The final thing--which, when I started working on this talk, was a sort of wild bogeyman--is the idea that new standards come into being without any reference whatsoever to the tradition of scholarly or documentary editing.

I thought that [idea] was kind of wild. But, in March, an organization called the Family History Information Standards Organization--which is backed by Ancestry.com, the Federation of Genealogy Societies, BrightSolid, a bunch of other organizations--announced a Call for Papers for standards for genealogists and family historians to use -- sometimes for representing family trees, sometimes for source documents.

And, in May, Call for Papers Submission number sixty-nine, "A Transcription Notation for Genealogy", was submitted.

Let's take a look at it.

Here we have what looks like a fairly traditional print notation. It's probably okay.

What's a little bit more interesting, though, is the bibliography.

Where is your work in this bibliography? It's not there.

Where is the Guide to Documentary Editing? It's not there.

So here's a new standard that was proposed the month before last. Now, I hope to respond to this--when I get the time--and suggest a few things that I've learned from people like you. But these standards are forming, and these standards may become what the public thinks of as standards for editing.

All right, so let's talk about the road to utopia.

The road to the utopia that Peter described I see as in part through partnerships between amateurs and professionals: you get amateurs participating in projects that are well run -- that teach them useful things about editing and how to encode manuscripts.

Similarly, you get professionals participating in the public conversation, so that your methodologies are visible. Certainly your editions are visible, but that doesn't mean that editing is visible. So maybe someone here wants to respond to that FHISO request, or maybe they just want to release guides to editing as Open Access.

As a result, amateurs produce higher-quality editions on their own, so that they're more useful for other researchers; so that they're verifiable.

And then, amateurs themselves become advocates -- not just for their material and the materials they're working on through crowdsourcing projects, but for editing as a discipline.

So that's what I think is the road to utopia.

So what about the past?

Back in Shillingsburg's "Dank Cellar" paper, he describes the problems with the e-texts that he's seeing, and he really encourages scholarly editors not to worry about it -- to disengage -- [and] instead to focus on coming up with methodologies--and again, this is 2006--for creating digital editions. He says that these aren't well understood yet. Let's not get distracted by these [amateur] things -- let's focus on what's involved in making and distributing digital editions.

Is he still right? I don't know.

Maybe--if we're in the post-digital age--it's time to re-engage.

Crowdsourcing + Machine Learning: Nicholas Woodward at TCDL

noreply@blogger.com (Ben W. Brumfield) — Sun, 02 Jun 2013 11:47:00 +0000

I was so impressed by Nicholas Woodward's presentation at TCDL this year that I asked him if I could share "Crowdsourcing + Machine Learning: Building an Application to Convert Scanned Documents to Text" on this blog.

Hi. My name is Nicholas Woodward, and I am a Software Developer for the University of Texas Libraries. Ben Brumfield has been so kind as to offer me an opportunity to write a guest post on his blog about my approach for transcribing large scanned document collections that combines crowdsourcing and computer vision. I presented my application at the Texas Conference on Digital Libraries on May 7th, 2013, and the slides from the presentation are available on TCDL’s website. This purpose of this post is to introduce my approach along with a test collection and preliminary results. I’ll conclude with a discussion on potential avenues for future work.

Before we delve into algorithms for computer vision and what-not, I’d first like to say a word about the collection used in this project and why I think it’s important to look for new ways to complement crowdsourcing transcription. The Guatemalan National Police Historical Archive (or AHPN, in Spanish) contains the records of the Guatemalan National Police from 1882-2005. It is estimated that AHPN contains more than 80 million pages of documents (8,000 linear meters) such as handwritten journals and ledgers, birth certificate and marriage license forms, identification cards and typewritten letters. To date, the AHPN staff have processed and digitized approximately 14 million pages of the collection, and they are publicly available in a digital repository that was developed by UT Libraries.

While unique for its size, AHPN is representative of an increasingly common problem in the humanities and social sciences. The nature of the original documents precludes any economical OCR solution on the scanned images (See below), and the immense size of the collection makes page-by-page transcription highly impractical, even when using a crowdsourcing approach. Additionally, the collection does not contain sufficient metadata to support browsing via commonly used traits, such as titles or authors of documents.

These characteristics of AHPN informed my idea to develop a different method for transcribing a large scanned document collection that draws on the work of popular crowdsourcing transcription tools such as FromThePage, Scripto and Transcribe Bentham. These tools allow users to transcribe individual records of a collection, maintaining quality control largely through redundancy and error checking.

The only drawback from this approach is that the results of crowdsourcing are only applicable to the document being transcribed. In contrast, my approach looks to break up documents into individual words with the idea that though no two documents are exactly alike they are likely to contain similar words. And across an entire corpus, particularly very large ones such as AHPN, words are likely to appear many times. Consequently, if users transcribe the words of one document, then I can use image matching algorithms to find other images of the same words and apply the crowdsourced transcription to the new images. The process looks like this:

Segment scanned documents into words
Crowdsource the transcription of a fraction of those words
Use image matching to pair images of transcribed images with unknown images
In the case of a match, associate the crowdsourced text with the document containing the unknown image

Step 1 attempts to segment images like so.

The key point here is that due to a host of factors (smudges, speckles, light text, poor typewriters, etc.) the segmentation will not be 100% accurate. This is OK because the goal is really only to get as many words as possible, understanding that if we can successfully capture the type of terms that users of the collection will typically use to search, i.e. dates or names of people, places, and organizations, then the online version of AHPN will become much more useful than it is now. Here are a few typical examples of documents from AHPN with the segmentation algorithm I developed using OpenCV libraries.

The result is a folder of small images containing individual words.

The second step is to crowdsource the transcription of as many of these words as possible. I developed an online application using CakePHP and MySQL that allows users to select documents and then transcribe the words they see.

The crowdsourcing output from the testing phase of the project looked like this.

Because there were so few images per word, I abandoned the original plans to create a multi-class SVM classifier that would find matching words in bulk. Instead, I drew on OpenCV again to develop a workflow for image matching that consisted of six steps.

Compare the width and height of each image
Compare the histograms of each image
Extract ‘interest points’ of each image using SURF algorithm
Match interest points between them
Calculate average distance between interest points
If the average is below a certain threshold, consider the images a match

The first step is fairly intuitive. Longer images likely contain words with more letters than shorter images, and a different number of letters indicates they are clearly not the same word. This step is relatively basic, but it turns out that in many, many cases it obviates the subsequent more computationally intensive work.

The second step is also relatively straightforward. Image histograms are just the counts of pixels of different hues in an image. In a typical color image, a histogram could be the number of pixels of each color. In this case the scanned documents are grayscale TIFF images, and calculating histograms is simply a matter of adding up all the black and white pixels separately (See below). Two images of different words may contain the same number of letters but the letters are different shapes and so their ratio of black to white pixels will also differ. Note: this does not solve the problem of words with similar letters in a different order, e.g. “la” and “al”. This case is handled in the next step.

Steps 3 through 6 represent the most computationally intensive part of the entire matching process, and they are based on the Speeded Up Robust Features (SURF) implementation in OpenCV. The basic idea is to use the SURF algorithm to find the “interest points” of an image and then measure the distance between points in two separate images. Interest points are curves, blobs and points where two lines meet. Going this route is both a strength and weakness to my approach. I’ll explain. Even if we don’t speak Spanish, it’s relatively easy to infer that the images below contain the same word, just in a different format. Whether they’re in all caps or all lowercase or the first letter is capitalized, the word is the same to a human.

But not to a computer.

And we see this when we use SURF to calculate the interest points of each image.

The final step to matching images involves computing the distance between interest points and finding the matches below a certain distance. In this case, I used the ratio test to eliminate poor interest point matches and then considered to images the same if their average distance was below .20. The example below attempts to match the word “placas” (plates) with several other words, and finds a match with another image of the word placas, despite the poor quality of the second image.

The point here is that we will need to match each of these images separately, even though they contain the same word. This means we’ll need more images to train a classifier (or image match), and we will have to find every possible way that a word (WORD, Word, word, etc.) appears in a corpus and crowdsource the transcription of at least a few of them before we classify unknown words. Ergo more time and work to finish.

But here is how it’s also a strength of the approach in at least two ways. The images above are Spanish words, but to a computer they’re just objects. They could really be in any language and the same basic workflow would apply. The only adjustments would be to crowdsourcers who understood the language (helpful for determining the word when it’s hard to make out and context helps) and tweaks to the algorithm parameters in steps 3-6. Second, while the format of the word is not necessarily important for keyword searches, it is relevant to many computation research methods such as document clustering, classification and finding named entities.
The output of this approach on the test collection from AHPN is as follows:

There are several implications from these results. First, shorter images, i.e. one or two letters, basically bring us back to the same issues of OCR. In the case of AHPN the quality of the scanned documents varies, and in many instances there are just not enough interest points to differentiate between ‘a’, ‘o’ and ‘e’. So my approach may only be suitable for longer words. This may not be too serious of an issue since in most cases short words are also stopwords, which are generally not needed for either keyword searches or computational research. Second, without a doubt, I will need to do the crowdsourcing on a much larger scale. 2.33 images per word is not enough training data to create an image classifier capable of finding other images of that word. And there is also a greater need for transcribing output because this approach requires images of each word in all of its forms, along with the transcribed text.

The future steps of the project, then, must include an avenue for acquiring more crowdsourced transcription. I think the best approach for this will be to refactor the CakePHP code into modules for Omeka, Drupal and other popular open source CMSs. Similar to the crowdsourcing tools mentioned above, Scripto, FromThePage, etc., I’ll need to enlist the users of particular collections who may have the most vested interest in seeing them become more accessible and functional. Another important component will be to continue development on the algorithms for image segmentation and matching. I am interested in looking at how to segment handwritten text, especially cursive text where it is particularly difficult to determine the spaces between words. Additionally, the image matching detailed above may be useful going forward because it is not particularly memory intensive and the work can be divided into separate tasks for a distributed computing environment, maybe something along the lines of SETI@Home. But with more output from crowdsourcing it would be good to incorporate either a Bag of Words or SVM classifier approach to process more words at a time. So there’s definitely plenty to do. Stay tuned!

Choosing Crowdsourced Transcription Platforms at SSA 2013

noreply@blogger.com (Ben W. Brumfield) — Fri, 24 May 2013 15:14:00 +0000

This is a transcript of my talk at the Society of Southwestern Archivists 2013 Annual Meeting.

[Update 2013-05-28: The audio for the talk may be downloaded as an MP3.]

This talk is about choosing a crowdsourced transcription platform, but "choosing" means a couple of things. "Choosing" means which, and "choosing" can mean whether-- should you do this at all?. I'd like to address the last and give a little background on crowdsourcing and transcription before I go into any kind of discussion of tool selection.

So the first question is, why transcribe? Because, after all, there are a lot of different crowdsourcing projects that are not transcription. You can do georectification. There are a lot of people doing tagging. After I'm done talking, Micah Erwin is going to give a presentation on his pretty amazing work doing crowdsourced identification of items within their collection. So why transcribe?

One reason to transcribe, is that many of us face a problem. Which is that if you have scanned documents, you have a problem:

Now what? The fundamental problem with this is that nobody's going to read it. Nobody's going to read this, because nobody's going to find it. Because Google cannot index handwritten materials. These are pixels; these aren't data -- they aren't words to search engines.

So all the serendipity that you get in the Internet age from search engines is not useful to you. Once you get these transcribed, you get the opportunity to connect with people who find you by searching for, say, their own name, and discover that you have material that contains their great-grandfather, who they're named after.

One of my most active volunteers is transcribing a diary that was written by someone he's not related to. He found out about the project because he is named after the diarist's mailman.

So why crowdsource, rather than doing all this transcription yourself?

Well, one argument is that it's free labor! You're getting people to do your work for you! This is a very powerful argument, and many of you may find it a very useful argument with your management. It may even be an argument for putting material online that you wouldn't otherwise.

Unfortunately, that's not true. It take a lot of effort to run a crowdsourcing project.

Now, I'm an open source developer, and in the open source world we tend to differentiate between "free as in beer" or "free as in speech".

Crowdsourcing projects are really "free as in puppy". The puppy is free, but you have to take care of it; you have to do a lot of work. Because volunteers that are participating in these things don't like being ignored. They don't like having their work lost. They're doing something that they feel is meaningful and engaging with you, therefore you need to make sure their work is meaningful and engage with them.

So if free labor isn't the reason for crowdsourcing, why do a crowdsourcing project?

One of the most interesting perspectives on this comes from Trevor Owens at the Library of Congress. He wrote a blog post last spring called "Crowdsourcing Cultural Heritage: The Objectives are Upside Down", in which he looked at the experience of volunteers participating in these crowdsourcing projects. And he says that fundamentally, this isn't about getting free labor from the public. This is about offering people a brand new and deeper way to interact with your collections: getting them to produce knowledge. Getting them to engage more deeply with the material you put online more deeply than just a consumer experience of scanning through things.

One example of that I'd like to give is a citizen science project. The North American Bird Phenology Program is a crowdsourcing project that is inviting the public to transcribe bird observation cards that were made by amateur bird watchers a hundred years ago, over the course of about seventy years.

Now, I tried this project out because I'm interested in transcription tools. I'm not interested in birds. But as I was going through marking up these observations cards from these different observers, I'm not really quite sitting at my computer anymore -- I'm deeply immersed within the documents.

And suddenly, I'm sitting with these guys, who are sitting and writing up their observations.

So you have this opportunity to engage people very deeply -- to immerse them in your materials by offering them this kind of way of participating.

One of the examples that I like to use was a collaboration between me and Kathryn Stallard at Southwestern University--raise your hand, please Kathryn--in which she put online a diary of the Mexican-American War. One volunteer--before we had even announced the project--went online and transcribed the entire diary. But he didn't just transcribe it--he didn't just type what he saw. He went back and made multiple revisions. He corrected things. He identified names of materials and locations and battles. He did research on the life histories of the people who were mentioned there.

This is not a consumer experience -- it's a way of pulling people into your materials. And yes, Kathryn did get a transcript out of the results. But I'm not sure that that was more valuable than the experience that Scott Patrick got going through transcribing, researching, and immersing himself within this diary of this Texan soldier.

So why crowdsource? So if you're bringing people into this experience--you're engaging members of the public who may live hundreds of miles away from your institution--what you're doing is converting them from site visitors into another kind of relationship with your institution and with your materials.

Paul Flemons at the Atlas of Living Australia--the Australia Museum--describes it this way: Fundamentally, by engaging the public in digitizing their collections, they're educating the public and satisfying that part of their mission. They are providing increased access to their collections that they would not have, again, with just images. But most importantly, they're building an advocacy network for their collections, for their institution, for their discipline.

So, if crowdsourcing is a way to convert site visitors into volunteers, and to convert volunteers into advocates, what's next?

I'm not sure--this is all very new--but we're exploring this. I'm working with an archives that possesses a popular author's drafts, and they're starting a crowdsourced transcription project. Part of what we're trying to do with that is linking the transcription and the transcripts to a donation campaign that is designated for digitizing more of their material.

So, we don't know--I'd love to come back next year and tell you how it worked out--but I'm really interested to see if we can create a virtuous cycle among digitization, crowdsourcing, fundraising that funds digitization, and on back. I would love to see this [succeed].

Okay, so how do you choose a platform? I'm a software developer, and I usually get up here and say, well, you could use this tool or this tool or this tool. Unfortunately, there are a lot of tools, so I'm not going to stand up here and walk you through thirty tools; I'm not even going to talk about the two that I've been building.

What I want to talk about instead are thing to consider when you're selecting a tool. I group selection factors into four categories. One of them is what kind of source material you're working with. Another is the purpose of the transcripts -- what you are going to use those for, and maybe what the public are going to use those for. Then there's the fit within your organization, then technical resources considerations.

So you really need to think about what source material you're working with before you choose your platform. People dealing with medieval manuscripts may want to work with a program which was developed for medieval manuscripts. That program may be totally unsuitable for nineteenth-century letters. So you really have to figure out what you're putting online first. (Fortunately, most of the people in this room are already starting with scans -- with things they've already digitized.)

There are a lot of other factors here, but what I really want to drive home is that there are fits between particular materials and particular tools, because there is no "one size fits all" tool for transcription.

The purpose: how are you going to be using the data? Are you going to be analyzing it? The people who are tracking these bird observations really want a searchable database that they can go through and do climate change and habitat change analysis on. People in this room may be more interested in extracting the subjects--the person names and place names that are mentioned within the documents. But there are a lot of different uses, so you need to think about that.

So how does this fit within your organization? There are platforms here which can be used behind closed walls. So maybe you have students who you want to give the job of transcribing, and you don't even want the public to be involved. Your goal is to improve undergraduate education by getting history students to interact with primary documents. Maybe, on the other hand, you really want to cast as wide a net as possible to engage people way outside your institution, and no one within your institution really cares about this particular material you have.

How long is the project going to last? Traditionally, crowdsourcing projects work well with the sorts of organic institutions that most people here [represent]. They work more poorly where they are, say, funded by a one-year grant -- where after a year of building up a community and working on the material, suddenly it's pencils down; lights off!

The final set of considerations are the financial and technical resources, which unfortunately may overwhelm all the other considerations.

I was talking to an archivist at a library in Belgium last month who had a set of medieval manuscripts and wanted to use T-PEN, a tool which was built specifically for medieval manuscripts. She knew all about it; she loved it. But her material was in Omeka; the tool didn't work with Omeka; so she was going to use Scripto. Which is a great tool, but it just supports plain text transcripts, which isn't really suited for the material. She knew that, but her material was here [gestures], so that was directing her decision.

I think that's a shame, but it's unfortunately an important factor. People don't want to have to set up multiple systems. If you have all your material in ContentDM, you don't really want step one [of a crowdsourcing project] to be get it all back out again.

All of these tools require some customization and some technical experience to get set up and running, so you need to consider whether you have people on-site who can do that or can pay people off-site who can do that. And then there are all the digital preservation issues, which everyone here understands very well.

So rather than going through the tools, I want to direct you to a Google document which has been contributed to by about twenty-four people who have added their own projects, explaining whether their tools support TEI or EAD, whether they support mark-up that's semantic or genetic, what their platforms are, what their rates are -- things like that.

So this TranscriptionToolGDoc is something I recommend. I love having conversations about this, so send me email and we'll brainstorm about projects.

Typologie des méthodes de contrôle de la qualité dans les projets de crowdsourcing

noreply@blogger.com (Ben W. Brumfield) — Tue, 14 May 2013 14:29:00 +0000

A translation of my 2012-03-05 post "Quality Control for Crowdsourced Transcription" which appeared in "Etat de l’art en matière de Crowdsourcing dans les bibliothèques numériques" by Moirez, Moreaux, and Josse (2013), reproduced for Francophone readers:

«Single-track methods»: le document ne fait l’objet que d’une seule transcription (par un seul contributeur ou de façon collaborative ensemble sur le même document)
1. «Open-ended community revison»: (Wikipédia) les utilisateurs peuvent continuer à modifier le texte transcrit, sans limite dans le temps. Un historique des modifications permet de revenir à la version précédente et d’éviter le vandalisme.
2. «Fixed-term community revision» (Transcribe Bentham) : convient pour des projets d’édition plus traditionnels, dont l’objectif est la publication d’une “version finale”. Quand une transcription atteint un niveau acceptable, val idée par les experts, elle est close et publiée.
3. «Community-controlled revision workflows» (Wikisource) : la transcription est considérée comme une “version finale” non plus par des experts, mais parce qu’elle a traversé un workflow collaboratif de correction/révision/validation -
4. «Transcriptions with "known-bad" insertions before proofreading» : dans une première phase, les correcteurs sont invités à transcrire. Puis d’autres correcteurs révisent la transcription en la comparant au texte original; pour s’assurer que la seconde lecture est bien réalisée, des erreurs sont ajoutées dans le texte: si toutes les «fausses erreurs» sont corrigées, le système déduit que les «vraies erreurs» ont dû être corrigées aussi.
5. «Single-keying with expert review» : lorsqu’une transcription a été réalisée par un contributeur, elle est validée ou rejetée par un expert (soit un professionnel de l’institution à l’origine du projet, soit un contributeur sélectionné). Si la correction est rejetée, elle est soit à nouveau soumise à correction, soit corrigée par l’expert et validée.
«Multi-track methods»: ces méthodes conviennent particulièrement à des corrections portant sur des données structurées ou des micro-tâches. La même image de départ est présentée à plusieurs contributeurs qui transcrivent chacun à partir de zéro. Généralement, les contributeurs ne savent pas s’ils sont les premiers correcteurs ou si d’autres transcriptions ont déjà été soumises. Puis les données ainsi collectées sont comparées automatiquement.
1. «Triple-keying with voting» (Old Weather, ReCAPTCHA) : l’image est présentée à 3 contributeurs, la majorité l’emporte (au depart, Old Weather proposait l’image à 10 contributeurs, mais ils se sont aperçus que la pertinence était sensiblement la même avec 3 qu’avec 10 contributeurs)
2. «Double-keying with expert reconciliation»: la même donnée est présentée à deux contributeurs, et, s’ils ne sont pas d’accord entre eux, un expert tranche.
3. «Double-keying with emergent community-expert reconciliation» (FamilySearch Indexing): la method est presque similaire à la précédente, sauf que l’expert qui tranche entre deux corrections divergentes est lui-même un contributeur, qui a été promu conciliateur grâce à l’analyse automatique de ses contributions (volume,pertinence).
4. «Double-keying with N-keyed run-off votes»: si les deux contributeurs ne sont pas d’accord, la correction est re-proposée à un nouveau duo/trio d’usagers.

Itinera Nova in the World(s) of Crowdsourcing and TEI

noreply@blogger.com (Ben W. Brumfield) — Mon, 29 Apr 2013 18:23:00 +0000

On April 25, 2013, I presented this talk at the International Colloquium Itinera Nova in Leuven, Belgium. It was a fantastic experience, which I plan to post (and speak) more about, but I wanted to get my slides and transcript online as soon as possible.

Abstract: Crowdsourcing for cultural heritage material has become increasingly popular over the last decade, but manuscript transcription has become the most actively studied and widely discussed crowdsourcing activity over the last four years. However, of the thirty collaborative transcription tools which have been developed since 2005, only a handful attempt to support the Text Encoding Initiative (TEI) standard first published in 1990. What accounts for the reluctance to adopt editorial best practices, and what is the way forward for crowdsourced transcription and community edition? This talk will draw on interviews with the organizers behind Transcribe Bentham, MoM-CA, the Papyrological Editor, and T-PEN as well as the speaker's own experience working with transcription projects to situate Itinera Nova within the world of crowdsourced transcription and suggest that Itinera Nova's approach to mark-up may represent a pragmatic future for public editions.

I'd like to talk about Itinera Nova within the world of crowdsourced transcription tools, which means that I need to talk a little bit about crowdsourced transcription tools themselves, and their history, and the new things that Itinera Nova brings.

Crowdsourced transcription has actually been around for a long time. Starting in the 1990s we see a number of what are called "offline" projects. This is before the term crowdsourcing was invented.

A Dutch initiative: Van Papier naar Digitaal which is transcribing primarily genealogy records.
FreeBMD, FreeREG, and FreeCEN in the UK, transcribing church registers and census records.
Demogen in Belgium -- I don't know a lot about this -- it appears to be dead right now, but if anyone can tell me more about this, I'd like to talk after this.
Archivalier Online--also transcribing census records--in Denmark,
And a series of projects by the Western Michigan Genealogy Society to transcribe local census records and also to create indexes of obituaries.

One thing these have in common, you'll notice, is that these are all genealogists. They are primarily interested in person names and dates. And they emerge out of an (at least) one hundred year old tradition of creating print indexes to manuscript sources which were then published. Once the web came online, the idea of publishing these on the web [instead] became obvious. But the tools that were used to create these were spreadsheets that people would use on their home computers. Then they would put CD ROMs or floppy disks in the posts and send them off to be pubished online.

Really the modern era of crowdsourced transcription begins about eight years ago. There are a number of projects that begin development in 2005. They are released (even though they've been in development for a while) starting around 2006. Familysearch Indexing is, again, a genealogy system primarily concerned with records of genealogical interest which are tabular. It is put up by the Mormon Church.

Then things start to change a little bit. In 2008, I publish FromThePage, which is not designed for genealogy records per se -- rather it's designed for 19th and 20th century diaries and letters. (So here we have more complex textual documents.) Also in 2008, Wikisource--which had been a development of Wikipedia to put primary sources online--start using a transcription tool. But initially, they're not using it for manuscripts because of policy in the English, French, and Spanish language Wikisources. The only people using it for manuscripts are the German Wikisource community, which has always been slightly separate. So they start transcribing free-form textual material like war journals [ed: memoirs] and letters. But again, we have a departure from the genealogy world.

In 2009, the North American Bird Phenology Program starts transcribing bird observations. So in the 1880s you had amateur bird-watchers who would go into the field and they would record their sightings of certain ducks, or geese, or things like that, and they would record the location and the birds they had observed. So we have this huge database of the presences of species throughout North America that is all on index cards. And as the climate changes and habitats change, those species are no longer there. So scientists who want to study bird migration and climate change need access to these. But they're hand-written on 250,000 index cards, so they need to be transformed. So that requires transcription, also by volunteers. [ed: The correct number of cards is over 6 million, according to Jessica Zelt's "Phenology Program (BPP): Reviving a Historic Program in the Digital Era"]

2010 is the year that crowdsourced transcription really gets big. The first big development is the Old Weather project, which comes out of the Citizen Science Alliance and the Zooniverse team that got started with GalaxyZoo. The problem with studying climate change isn't knowing what the climate is like now. It is very easy to point a weather satellite at the South Pacific right now. The problem is that you can't point a weather satellite at the South Pacific in 1911. Fortunately, in many of the world's navies, the officer of the watch would, every four hours, record the barometric pressure, the temperature, the wind speed and direction, the latitude and the longitude in the ships logs. So all we have to do is type up every weather observation for all the navies' ships, and suddenly we know what the climate was like. Well, they've actually succeeded at this point -- in 2012 they finished transcribing all the British Royal Navy's ships log weather observations during World War I. So this has been very successful -- it's a monumental effort: they have over six hundred thousand registered accounts--not all of those are active, but they have a very large number of volunteers.

Also in 2010 in the UK, Transcribe Bentham goes live. (We'll talk a lot more about this -- it's a very well documented project.) This is a project to transcribe the notes and papers of the utilitarian philosopher Jeremy Bentham. It's very interesting technically, but it was also very successful drawing attention to the world of crowdsourced transcription.

In 2011, the Center for History and New Media at George Mason University in northern Virginia published the Papers of the United States War Department, and builds a tool called Scripto that plugs into it. Now this is primarily of interest to military and social historians, but again we're getting away from the world of genealogy, we're getting away from the world of individual tabular records, and we're getting into dealing with documents.

Once we get there, we have a tension. And this is a pretty common tension. There's an institutional tension, in that editing of documents has historically been done by professionals, and amateur editions have very bad reputations. Well now we're asking volunteers to transcribe. And there's a big tension between, well how do volunteers deal with this [process], do we trust volunteers? Wouldn't it be better just to give us more money to hire more professionals? So there's a tension there.

There's another tension that I want to get into here, since today is the technical track, and that's the difference between easy tools and powerful tools, and [the question of] making powerful tools easy to use. This is common to all technology--not just software, and certainly not just crowdsourced transcription--but it's new because this is the first time we're asking people to do these sorts of transcription projects.

Historically these professional [projects] have been done using mark-up to indicate deletions or abbreviations or things like that.

So there's this fear: what happens when you take amateurs and add mark-up?

Well, what is going to happen? Well, one solution--and it's a solution that I'm distressed to say is becoming more and more popular in the United States--is to get rid of the mark-up, and to say, well, let's just ask them to type plain text.

There's a problem with this. Which is that giving users power to represent what they see--to do the tasks that we're asking them to do--enables them. Lack of power frustrates them. And when you're asking people to transcribe documents that are even remotely complex, mark-up is power.

So I'm going to tell a little story about scrambled eggs. These are not the scrambled eggs that I ate this morning--which were delicious by the way--but they're very similar.

I'm going to pick on my friends at the New York Public Library, who in 2011 launched the "What's on the Menu?" project. They have an enormous collection of menus from around the world, and they want to track to culinary history of the world as dishes originate in one spot and move to other locations, the change in dishes--when did anchovies become popular? Why are they no longer popular?--things like that. So they're asking users to transcribe all of these menu items. They developed a very elegant and simple UI. This UI did not involve mark-up; this is plain-text. In fact--I'm going to get over here and read this--if you look at this instruction, this is almost stripped text: "Please type the text of the indicated dish exactly as it appears. Don't worry about accents."

Well, this may not be a problem for Americans, but it turns out that some of their menus are in languages that contain things that American developers might consider accents. This is a menu that was published on their site in 2011. They sent out an appeal asking, "can anyone read Sütterlin or old German Kurrentschrift"? I saw this and I went over to a chat channel for people who are discussing German and the German language, because I knew that there were some people familiar with German paleography there, and I wanted to try it out.

So the transcribers are going through and they're transcribing things, and they get to this entry: Rühreier. All right, let's transcribe that without accents. So they type in what they see. Rühreier is scrambled eggs. And what they type is converted to "Ruhreier", which are... eggs from the Ruhrgebiet? I don't know? This is not a dish. I'm not familiar with German cuisine, but I don't think that the Ruhr valley is famous for its eggs.

And this is incredibly frustrating! We see in the chat room logs: "Man, I can't get rid of 'Ruhreier' and this (all-capital) 'OMELETTE'! What's going on? Is someone adding these back? Can you try to change "Ruhreier" to "Rühreier"? It keeps going back!"

So we have this frustration. We have this potential to lose users when we abandon mark-up; when we don't give them the tools to do the job that we're asking them to do.

Okay. Let's shift gears and talk about a different world. This is the world of TEI, the Text Encoding Initiative. It's regarded as the ultimate in mark-up -- Manfred [Thaller] mentioned it some time earlier. It's been a standard since 1990, and it's ubiquitous in the world of scholarly editing.

Remember, up until recently, all scholarly editing was done by professionals. These professionals were using offline tools to edit this XML which Manfred described as a "labyrinth of angle brackets." It was never really designed to be hand-edited, but that's what we're doing.

And because it's ubiquitous and because it's old, there's a perception among at least some scholars, some editors, that this is just a 'boring old standard'. I have a colleague who did a set of interviews with scholars about evaluating digital scholarship, and not all but some of the responses she got when she brought up TEI were "TEI? Oh, that's just for data entry."

Well, not quite. TEI has some strengths. It is an incredibly powerful data model. The people who are doing this--these professionals who have been working with manuscripts for decades--they've developed very sophisticated ways of modeling additions to texts, deletions to texts, personal names, foreign terms -- all sorts of ways of marking this up.

It has great tools for presentation and analysis. Notice I didn't say transcription.

And it has a very active community, and that community is doing some really exciting things.

I want to use just one example of something that has only been around in the last four years that it's been developed. It's a module that was created for TEI called the Genetic Edition module. A "genetic edition" is the idea of studying a text as it changes -- studying the changes that an author has made as they cross through sections and created new sections, or over-written pieces.

So it's very sophisticated, and I want to show you the sorts of things you can do [with it] by demostrating an example of one of these presentation tools by Elena Pierazzo and Julie Andre. Elena's at King's College London, and they developed this last year.

This is a draft of--I believe it's Proust's Recherches du Temps Perdu--unfortunately I can't see up there. But as you can see, this is a very complicated document. The author has struck through sections and over-written them. He's indicated parts moved. He's even -- if you look over here -- he's pasted on an extra page to the bottom of this document. So if you can transcribe this to indicate those changes, then you can visualize them.

[Demo screenshots from the Proust Prototype.] And as you slide, you see transcripts appear on the page in the order that they're created,

And in the order that they're deleted even.

There's even rotation and stuff --

It's just a brilliant visualization!

So this is the kind of thing that you can do with this powerful data model.

But how was that encoded? How did you get there?

Well, in this case, this is an extension to that thousand-page book. It's only about fifty pages long, printed, and it contains individual sets of guidelines. In this case, this is how Henrik Ibsen clarified a letter. In order to encode this, you use this rewrite tag with a cause... And this is that forest of angle brackets; this is very hard. And this is only one item from this document of instructions, which was small enough that I could cut it out and fit it on a slide.

So this is incredibly complex. So if TEI is powerful; and if, as it gets more complex, it becomes harder to hand-encode; and as we start inviting members of the public and amateurs to participate in this work, how are we going to resolve this?

If there's a fear about combining amateurs and mark-up, what do we do when we combine amateurs with TEI? This is panic!

And it is very rarely attempted. I maintain a directory of crowdsourced transcription tools, with multiple projects per tool. And of the 29 projects in this directory, only 7 claim to support TEI.

One of them is Itinera Nova. I found out about this when I was preparing a presentation for the TEI conference last year, in which I interviewed people running projects doing this crowdsourcing, and found out about their experience of users trying to encode in TEI, and asked, "Do you know anyone else?"

And that's how I found out about Itinera Nova, which is unfortunately not very well known outside of Belgium. This is something that I hope to part of correcting, because you have a hidden gem here -- you really do. It is amazing.

So how do you support TEI? Well, one approach--the most common approach--is to say we'll have our users enter TEI, but we'll give them help. We'll create buttons that add tags, or menus that add tags. This has been the approach taken by T-PEN (created by the Center for Digital Thelogy out of Saint Louis University), and a project associated with them, the Carolingian Canon Law Project. It's also the approach taken by Transcribe Bentham with their TEI toolbar. Menus are an alternative, but essentially the do the same thing -- they're a way of keeping users from typing angle brackets. So the Virtuelles deutsches Urkundennetzwerk is one of those, as well as the Papyrological Editor which is used by scholars studying Greek papyri.

So how well does that work? You provide users with buttons that add tags to their text. Here's an example from Transcribe Bentham.

Here's an example from Monasterium. And the results are still very complicated. The presentation here is hard. It's hard to read; it's hard to work with.

That does not mean that amateurs cannot do it at all! Certainly the experience of Transcribe Bentham proves that amateurs to the same level as any professional transcriber, using these tools and coding these manuscripts, even without the background.

But there are limitations. One limitation is that users outgrow buttons. In Transcribe Bentham, [the most active] users eventually just started typing the angle brackets themselves -- they returned to that labyrinth of angle brackets of TEI tags.

Another problem is more interesting to me, which is when users ignore buttons. Here we have one editor who's dealing with German charters, who uses these double-pipes instead of the line break tag, because this is what he was used to from print. This speaks to something very interesting, which is that we have users who are used to their own formats, they're used to their own languages for mark-up, they're used to their own notations from print editions that they have either read or created themselves. And by asking them to switch over to this style of tagging, we're asking them not just to learn something new, but also to abandon what they may already know.

And, frankly, it's really hard to figure out which buttons [to support]. Abigail Firey of the Carolingian Canon Law Project talks about how when they were designing their interface, they had 67 buttons. This is very hard to navigate, and the users would just give up and start typing angle brackets instead, because buttons aren't a magic solution.

This is where Itinera Nova comes in. The "intermediate notation" that Professor Thaller was talking about is quite clear-cut, and it maps well to the print notations that volunteers are already used to.

And what's interesting about this is that what many people may not realize is that Itinera Nova--despite having a very clear, non-TEI interface--has full TEI under the hood.

Everything is persisted in this TEI database, so the kinds of complex analysis that we talked about earlier--not necessarily the Proust genetic editions, but this kind of thing--is possible with the data that's being created. It's not idiosyncratic.

So as a result, I really think that in this, Itinera Nova points the way to the future. Which is to abandon this idea that TEI is just for data entry, or that amateurs cannot do mark-up. Both of those ideas are bogus! Instead, let's say: use TEI for the data model; for the presentation, so we have these beautiful sliders. And whatever else will get created out of the annotation tool, out of the transcription tool, let's use that for the data model and for the presentation. But let's consider let's consider hooking up these--I don't want to say "easier"--but these more straightforward, these more traditional user interfaces [for transcription].

This is something that I think is really the way forward for crowdsourced transcription. It is being done right now by the Papyrological Editor, it has been done by Itinera Nova for a long time. And there are now some incipient projects to move forward with this. One of these is a new project at the University of Maryland, Maryland Institute for Technology and the Humanities, the Skylark project, in which they are taking those same transcription tools that were used for Old Weather to allow people to mark up and transcribe portions of an image of a literary text that has been heavily annotated--like that Proust--to create data using the data model that can be viewed with tools like the Proust viewer.

So this is, I think, the technical contribution that Itinera Nova is making. Obviously there are a lot more contributions--I mean I'm absolutely stunned by the interaction with the volunteer community that's happening here--but I'm staying on the technical track, so I'm not going to get into that.

Are there any questions? No? Keep up the great work -- you folks are amazing.

Ngoni Munyaradzi on Transcribe Bleek and Lloyd

noreply@blogger.com (Ben W. Brumfield) — Tue, 26 Feb 2013 20:59:00 +0000

Ngoni Munyaradzi is a Master's student in Computer Science at the University of Cape Town, South Africa, working on a research project on the transcription of the Digital Bleek and Lloyd collection. He kindly agreed to an interview over email, which I present below:

Your website does an excellent job explaining the background and motivation of Transcribe Bleek and Lloyd. Can you tell us more about the field notebooks you are transcribing?

The Digital Bleek and Lloyd Collection is composed of dictionaries, artwork and notebooks documenting stories about the earliest inhabitants of Southern Africa, the Bushman people. The notebooks were written by Wilhelm Bleek, his sister-in-law, Lucy Lloyd and Dorothea Bleek (Wilhelm's daughter) in the 19th century, with the help of a number of Bushmen people who were prisoners in the Western Cape region of South Africa at the time. The notebooks were recorded in the |Xam and !Kun languages and English translations of these languages are available in the notebooks.

Link to the collection: http://lloydbleekcollection.cs.uct.ac.za/

Correct me if I'm wrong, but it seems like at least in the case of |Xam, you are working with one of the only representatives of an extinct language. Are there any standard data models for these kinds of vocabularies/bilingual texts which you're using?

There are no complete models - the best known models are still only partial.

I suspect that I'm not alone in wondering why these Bushman people were prisoners during the writing of these texts. Can you tell us a bit more about the Bleek/Lloyd informants, or point us to resources on the subject?

The bushman people were prisoners because of petty crimes and a grossly unfair colonial government. On the Bleek and Lloyd website there is a story on each contributor. There is information in various books on the subject as well, but I am not sure there is more that is known than what is on the website. see:
http://lloydbleekcollection.cs.uct.ac.za/xam.html
http://lloydbleekcollection.cs.uct.ac.za/kun.html

This is the first transcription project I'm aware of using the Bossa Crowd Create platform. What are the factors that led you to choose that platform and what's been your experience setting it up?

In 2011 when our project began Bossa was the most mature opensource crowdsourcing framework that was tailored for volunteer projects available. Due to this Bossa suited well with the project's requirements. The alternative crowdsourcing frameworks available at the time used payment methods.

Setting up the Bossa framework was a relatively straight-forward task. The documentation online is very thorough and with examples of how to set-up test applications. I also got assistance from David Anderson the developer of Bossa.

The Bushman writing system seems extremely complex with it's special characters and multiple diacritics. I see that you are using LaTeX macros to encode these complexities. Why did you decide on LaTeX and what has been the user response to using that notation?

So the project is part of ongoing research related to the Bleek and Lloyd Collection within our Digital Libraries Laboratory at the University of Cape Town. Credit for developing the encoding tool goes to Kyle Williams. And the reason why he chose to use LaTeX was that; using custom LaTeX macros allowed for both the problem of the encoding and visual rendering of the text to be solved in a single step. Developing a unique font for the Bushman script is something we might look at in the future!

Here's a link to a paper published on the encoding tool developed by Kyle Williams: http://link.springer.com/chapter/10.1007%2F978-3-642-24826-9_28

Overall the user feedback has been good, as most users are able to complete transcriptions using the LaTeX macros. We have gotten suggestions from users to use glyphs to encode the complexities. Currently the scope of my masters research project does not include that. There are talks in our research group to develop a unique font to represent the |Xam and !Kun languages, as this is not supported by Unicode.

User 1 Comment: "I think the palette handles the complexity of the character set very well. This material is inherently difficult to transcribe. The tool has, on the whole, been well thought out to meet this challenge. I think it needs to be improved in some ways, but considering the difficulties it is remarkably well done."

User 2 Comment: "VERY intuitive, after a few practice transcriptions. I actually enjoyed using the tool after a page was done."

This is incredibly useful. So far as I'm aware, yours is only the third crowdsourced transcription project that's surveyed users seriously (after the North American Bird Phenology Project and Transcribe Bentham). Do you have any advice on collecting user feedback at such an early stage?

Collecting user feedback in the early stages will tremendously help project administrators determine whether the setup of the project is easy to follow for participants. One can easily pick up any hindrances to user participation and address these early. From our project, I've found that participants can actually suggest very helpful ideas that will make the data collection process better.

Crowdsourced citizen science and cultural heritage projects have mostly been based in the USA, Northern Europe and Australia until recently -- in fact, yours is the first that I'm aware of originating in sub-Saharan Africa. I'd really like to know which projects inspired your work with Transcribe Bushman, and what your hopes are for crowdsourced transcription projects focusing on Africa?

Our work was mostly inspired by the success of GalaxyZoo at recruiting volunteers, and also the Transcribe Bentham project that explored the feasibility of volunteers performing transcription. I hope that more crowdsourced transcription projects will start-up within Africa in the near future. What would be interesting is to see a transcription project for the Timbuktu manuscripts of Mali. Beyond transcription, I would like to see other researchers adopting crowdsourcing in fields of specialty within Africa.

Thanks so much for this interview. If people want to help out on the project, what's the best way for them to contribute?

Interested participants can simply:

Create an account on the project website.
Watch a 5 minute video tutorial on how to transcribe the Bushman languages.
With that, you are ready to start transcribing pages.

Detecting Handwriting in OCR Text

noreply@blogger.com (Ben W. Brumfield) — Mon, 25 Feb 2013 14:56:00 +0000

This is my fourth and final post about the iDigBio Augmenting OCR Hackathon. Prior posts covered the hackathon itself, my presentation on preliminary results, and my results improving the OCR on entomology specimens. The other participants are slowly adding their results to the hackathon wiki, which I recommend checking back with (their efforts were much more impressive than mine).

Clearly handwritten: T=8, N=78% from terse and noisy OCR files

Let's say you have scanned a large number of cards and want to convert them from pixels into data. The cards--which may be bibliography cards, crime reports, or (in this case) labels for lichen specimens--have these important attributes:

They contain structured data (e.g. title of book, author, call number, etc. for bibliographies) you want to extract, and
They were part of a living database built over decades, so some cards are printed, some typewritten, some handwritten, and some with a mix of handwriting and type.

The structured aspect of the data makes it quite easy to build a web form that asks humans to transcribe what they see on the card images. It also allows for sophisticated techniques for parsing and cleaning OCR (which was the point of the hackathon). The actual keying-in of the images is time consuming and expensive, however, so you don't want to waste human effort on cards which could be processed via OCR.

Since OCR doesn't work on handwriting, how do you know which images to route to the humans and which to process algorithmically? It's simple: any images that contain handwriting should go to the humans. Detecting the handwriting on the images is unfortunately not so simple.

I adopted a quick-and-dirty approach for the hackathon: if OCR of handwriting produces gibberish, why send all the images through a simple pass of OCR and look in the resulting text files for representative gibberish? In my preliminary work, I pulled 1% of our sample dataset (all cards ending with "11") and classified them three ways:

Visual inspection of the text files produced by an ABBY OCR engine,
Visual inspection of the text files produced by the Tesseract OCR engine, and
Looking at the actual images themselves.

To my surprise, I was only able to correctly classify cards from OCR output 80% of the time -- a disappointing finding, since any program I produced to identify handwriting from OCR output could only be less accurate. More interesting was the difference between the kinds of files that ABBY and Tesseract produced. Tesseract produced a lot more gibberish in general--including on card images that were entirely printed. ABBY, on the other hand, scrubbed a lot of gibberish out of its results, including that which might be produced when it encountered handwriting.

This suggested an approach: look at both the "terse" results from ABBY and the "noisy" results from Tesseract to see if I could improve my classification rate.

Easily classified as type-only, despite (non-characteristic) gibberish: T=0,N=0 from terse and noisy OCR files.

But what does it mean to "look" at a file? I wrote a program to loop through each line of an OCR file and check for the kind of gibberish characteristic of OCR and handwriting. Inspecting the files reveals some common gibberish patterns, which we can sum up as regular expressions:

GARBAGE_REGEXEN = {
  'Four Dots' => /\.\.\.\./,
  'Five Non-Alphanumerics' => /\W\W\W\W\W/,
  'Isolated Euro Sign' => /\S€\D/,
  'Double "Low-Nine" Quotes' => /„/,
  'Anomalous Pound Sign' => /£\D/,
  'Caret' => /\^/,
  'Guillemets' => /[«»]/,
  'Double Slashes and Pipes' => /(\\\/)|(\/\\)|([\/\\]\||\|[\/\\])/,
  'Bizarre Capitalization' => /([A-Z][A-Z][a-z][a-z])|([a-z][a-z][A-Z][A-Z])|([A-LN-Z][a-z][A-Z])/,
  'Mixed Alphanumerics' => /(\w[^\s\w\.\-]\w).*(\w[^\s\w]\w)/
}

However, some of these expressions match non-handwriting features like geographic coordinates or bar codes. Handling these requires a white list of regular expressions for gibberish we know not to be handwriting:

WHITELIST_REGEXEN = {
  'Four Caps' => /[A-Z]{4,}/,
  'Date' => /Date/,
  'Likely year' => /1[98]\d\d|2[01]\d\d/,
  'N.S.F.' => /N\.S\.F\.|Fund/,
  'Lat Lon' => /Lat|Lon/,
  'Old style Coordinates' => /\d\d°\s?\d\d['’]\s?[NW]/,
  'Old style Minutes' => /\d\d['’]\s?[NW]/,
  'Decimal Coordinates' => /\d\d°\s?[NW]/,  
  'Distances' => /\d?\d(\.\d+)?\s?[mkf]/,  
  'Caret within heading' => /[NEWS]\^s/,
  'Likely Barcode' => /[l1\|]{5,}/,
  'Blank Line' => /^\s+$/,
  'Guillemets as bad E' => /d«t|pav«aont/  
}

With these on hand, we can calculate a score for each file based on the number of occurrences of gibberish we find per line. That score can then be compared against a threshold to determine whether a file contains handwriting. Due to the noisiness of the Tesseract files, I found it most useful to calculate their score N as a percentage of non-blank lines, while the score for the terse files T worked best as a simple count of gibberish matches.

Threshold	Correct	False Positives	False Negatives
T > 1 and N > 20%	82%	10 of 45	8 of 60
T > 0 and N > 20%	84%	13 of 45	4 of 60
T > 1	79%	10 of 45	12 of 60
N > 20%	75%	8 of 45	18 of 60
N > 10%	81%	14 of 45	6 of 60

One interesting thing about this approach is that adjusting the thresholds lets us tune the classifications for resources and desired quality. If our humans doing data entry are particularly expensive or impatient, raising the thresholds should ensure that they are only very rarely sent typed text. On the other hand, lowering the thresholds would increase the human workload while improving quality of the resulting text.

One of the false negatives: T=0, N=10% from parsing terse and noisy text files.

I'm really pleased with this result. The combined classifications are slightly better than I was able to accomplish by looking at the OCR myself. The experience of a volunteer presented with 56 images containing handwriting and 13 which don't may necessitate a "send to OCR" button in the user interface, but must be less frustrating than the unclassified ratio of 45 in 105 from the sample set. With a different distribution of handwriting-to-type in the dataset, the process might be very useful for extracting rare typed material from a mostly-handwritten set, or vice versa.

All of the datasets, code, and scored CSV files are in iDigBio AOCR Hackathon's HandwritingDetection reposity on GitHub..

Results of the "Ocrocrop" Approach to Improving OCR

noreply@blogger.com (Ben W. Brumfield) — Fri, 15 Feb 2013 22:21:00 +0000

This project attempted to improve the quality of OCR applied to difficult entomology images[*] by cropping labels from the images to run through OCR separately. In order to identify labels on the image to crop, an initial, 'naive' pass of OCR was made over the whole image, generating both

A) a set of rectangles on the image defined as word bounding boxes by the OCR engine, and
B) a control OCR text file to be used for comparing the 'naive' model with the methodology.

Those word rectangles were then filtered, consolidated, and filtered again to identify the labels on the image, which were then extracted and run through the OCR engine separately. The resulting OCR output files were then concatenated into a single text file, which was compared against the 'naive' output described in A (above).

I'll call this method "ocrocrop". (For more detail on method, see the transcript of my preliminary presentation.)

The results were encouraging. (See CSV file listing results for each file, and the directory containing "naive" output, annotated JPGs, and cleaned output files for each test.)

Of 80 files tested, 20 experienced a decrease in score (see Alex Thomson's scoring service), but most (14/20) of those were on OCR output below 10% accuracy in the first place, and the remainder were at or below 20% accuracy. So it is reasonable to say that the ocrocrop method only degraded the quality of texts that were unusable in the first place.

40 of the 80 files tested showed more promising results, showing improvements from one to twenty percentage points -- in some cases only marginally improving unusable (below 10% accurate) outputs, but in many cases improving the scores more substantially (say from 25% to 35% in the case of EMEC609908_Stigmus_sp).

Most of the top quartile of results saw improvements on texts that were already scoring above 10% accuracy rates (16 of 20), so it appears that the effectiveness of the ocrocrop method is correlated to the quality of the naive input data -- garbage is degraded or only minimally improved, while OCR that is merely bad under the naive approach can be significantly improved.

The ocrocrop method saw the greatest improvement in cases where the naive OCR pass was effective at identifying word bounding boxes, but ineffective at translating their contents into words. Taking EMEC609928_Stigmus_sp, the case of greatest improvement (naive: 18.9%, ocrocrop: 70.5%), we see that all words on the labels except for the collector name were recognized as words (in purple), making the cropped label images (in blue) good representatives of the actual labels on the image.

The cropped image was more easily processed by our OCR image, so that we may compare the naive version of the second label:

 CALIF:Hunbo1dt Co. ;‘ ~
 3 m1.N' Garbervﬂle ,::f< '_- '
 v—23~75 n.n1e:z.' 9 ._ ’

with the ocrocrop version of the second label:

 CALIF:Humboldt Co.
 3 mi.N Garberville
 V-23-76 R.Dietz,'

One of the problems with the OCR-based pre-processing which may be hidden by the scores is that many labels are entirely missed by the ocrocrop if the first, naive OCR pass failed to identify any words at all on the label. In cases such as EMEC609651_Cerceris_completa, the determination label was not cropped (indicated by blue rectangles) because no words (purple rectangles) were detected by the original. As a result, while the ocrocrop OCR is an improvement over the naive OCR (6.6% vs. 6.5%), substantial portions of text on the image are unimproved because they are unattempted.

There are two possible ways to solve this problem. One is to abandon the ocrocrop model entirely, switching back to a computer vision approach -- either by programmatically locating rectangles on the image (as Phuc Nguyen demonstrated) or by asking humans to identify regions of interest for OCR processing (as demonstrated by Jason Best in Apiary and by Paul and Robin Schroeder in ScioTR). The other option is to improve the naive OCR -- perhaps by swapping out the engine (e.g. use ABBY instead of Tesseract), perhaps by using a different image pre-processor (like ocropus's front-end to Tesseract), perhaps by re-training Tesseract.

I suspect that a computer vision approach to extracting entomology labels (or similar pieces of paper photographed against a noisy background) will provide a more effective eventual solution than the ocrocrop method. Nevertheless, the ocrocrop "bang it with a rock until it works" approach has a lot of potential to take entomology-style OCR to bad from worse.

[*]In addition to the difficulties typical of specimen labels--mix of typefaces, handwritten material, typewritten material, text inventory with few overlaps with a dictionary of literary English--the entomology dataset contained additional challenges. Difficulties included the following:

Images containing specimens and rulers as well as labels.
Labels casually arranged for photography, so that text orientation was not necessarily aligned.
Labels photographed against a background of heavily pin-pricked styrofoam rather than a black or neutral background.
3-d images including what appear to be shadows, which soften the contrast differences around borders.

iDigBio Augmenting OCR Hackathon

noreply@blogger.com (Ben W. Brumfield) — Fri, 15 Feb 2013 22:00:00 +0000

I spent the last three days at the iDigBio Augmenting OCR Hackathon working alongside mycologists, botanists, entomologists, herbarium managers, and bioinformaticians to explore ways to improve parsing of digitized specimen labels. While I'm pleased with the results of my own contribution, I'd like to take a minute to talk about the hackathon process itself before I post them.

This was my first hackathon--a condition which seemed to be the rule among the participants--and I was really impressed with it. The iDigBio folks defined a clear set of goals (improve OCR parsing of specimen labels) with clear metrics (these datasets, these output formats, this scoring algorithm) a couple of months beforehand, and organized five weekly videoconferences before the event. Most important of all, the participants were encouraged to prepare a 10-minute lightning talk on their efforts and preliminary results. (See below for the transcript of my talk, see the notes document for descriptions of all talks.)

In my opinion, these preliminary talks were critical to the success of the project. The preliminary nature relaxed pressure on participants, so we were able to experiment beyond the target of the hackathon (as I did with my handwriting detection digression, a related, but un-scorable effort). On the other hand, they did provide enough impetus to get many of us looking at the data, working with the tools, and thinking about approaches. This meant that even before the hackathon started, many of us were familiar enough with the materials to have a real 'meeting of the minds' experience during the pre-event supper: "Did you just say 'the contrast difference between the print and the label is higher than the difference between the label and the background'? We ran into that too, and here's what we did..."

The experience was a real education in OCR for me, and I feel like I picked up techniques I can apply directly to projects I've discussed with clients and potential clients. In particular, I got a real appreciation for how interrelated image preparation, OCR, and parsing are to each other. One participant had created separate libraries of regular expressions to clean up each kind of field, having discovered that latitude/longitude coordinates require different error correction than personal names or herbarium catalog numbers do. Another group had built a touch-screen tool for classifying segments of the image before submitting them to OCR. My own project required a first pass of OCR to clean images before sending them to a second, 'real' pass of OCR. A simple 1,2,3 workflow just isn't sufficient!

iDigBio itself is an NSF-funded attempt to advance digitization practices on natural history collections, combining disciplinary "thematic collection networks" and methodologically focused working groups on topics like georeferencing, crowdsourcing, and OCR. Aware that they're not the only people digitizing things, they have been reaching out beyond the natural sciences to the library and information science community at the iConference this year. This rejection of "not invented here" siloing was a big part of the hackathon, and I hope that more people from outside the natural sciences will get involved.

Improving OCR Inputs from OCR Outputs?

noreply@blogger.com (Ben W. Brumfield) — Thu, 14 Feb 2013 15:32:00 +0000

This is a transcript of my talk at the iDigBio Augmenting OCR Hackathon, presenting preliminary results of my efforts before the event.

For my preliminary work, I tried to improve the inputs to our OCR process through looking at the outputs of a naive OCR.

One of the first things that we can do to improve the quality of our inputs to OCR is to not feed them handwriting. To quote Homer Simpson, "Remember son, if you don't try, you can't fail." So let's not try feeding our OCR processes handwritten materials.

To do this, we need to try to detect the presence of handwriting. When you try to feed handwriting to OCR, you get a lot of gibberish. If we can detect handwriting, we can route some of our material to "humans in the loop" -- not wasting their time with things we could be OCRing. So how do we do this?

My approach was to use the outputs of [naive] OCR to detect the gibberish it produces when it sees handwriting to try to determine when there was handwriting present in the images. The first thing I did before I started programming, was classifying OCR output from the lichen samples by visual inspection: whether I thought there was hand writing present or not, based on looking at the OCR outputs. Step two was to automate the classifications.

I tried this initially on the results that came out of ABBY and then the results that came out of Tesseract, and I was really surprised by how hard it was for me as a human to spot gibberish. I could spot it, but in a lot of cases -- ABBY does a great job of cleaning up its OCR output -- so in a lot of cases, particularly the labels that were all printed with the exception of some species name that was handwritten, ABBY generally misses those. Tesseract, on the other hand, does not produce outputs that are quite as clean.

So the really interesting thing about this to me is that while we were able to get 70-75% accuracy on both ABBY and Tesseract, if you look at the difference between the false positives that come out of ABBY and Tesseract and the false negatives, I think there is some real potential here for making a much more sophisticated algorithm. Maybe the goal is to pump things through ABBY for OCR, but beforehand look at Tesseract output to determine whether there is handwriting or not.

The next thing I did was try to automate this. I just used some regular expressions to look for representative gibberish, and then based on the number of matches got results that matched the visual inspection, though you do get some false positives.

The next thing I want to do with this is to come up with a way to filter the results based on doing a detection on ABBY [output] and doing a detection on Tesseract [output].

The next thing that I wanted to work on was label extraction.

We're all familiar with the entomology labels and problems associated with them.

So if you pump that image of Cerceris through Tesseract, you end up with a lot of garbage. You end up with a lot of gibberish, a lot of blank lines, some recognizable words. That "Cerceris compacta" is, I believe, the result of a post-digitzation process: it looks like an artifact of somebody using Photoshop or ImageMagick to add labels to the image. The rest of it is the actual label contents, and it's pretty horrible. We've all stared at this; we've all seen it.

So how do you sort the labels in these images from rulers, holes in styrofoam, and bugs? I tried a couple of approaches. I first tried to traverse the image itself, looking for contrast differences between the more-or-less white labels and their backgrounds. The problem I found with that was that the highest contrast regions of the image are the difference between print and the labels behind the print. So you're looking for a fairly low-contrast difference--and there are shadows involved. Probably, if I had more math I could do this, but this was too hard.

So my second try was to use the output of OCR that produces these word bounding boxes to determine where labels might be, because labels have words on them.

If you run Tesseract or Ocropus with an "hocr" option, you get these pseudo-HTML files that have bounding boxes around the text. Here you see this text element inside a span; the span has these HTML attributes that say "this is an OCR word". Most importantly, you have the title attribute as the bounding box definition of a rectangle.

If you extract that and re-apply it to an image, you see that there are a lot of rectangles on the image, but not all the rectangles are words. You've got bees, you've got rulers; you've got a lot of random trash in the styrofoam.

So how do we sort good rectangles from bad rectangles? First I did a pass looking at the OCR text itself. If the bounding box was around text that looked like a word, I decided that this was a good rectangle. Next, I did a pass by size. A lot of the dots in the stryofoam come out looking suspiciously word-like for reasons I don't understand. So if the area of the rectangle was smaller than .015% of the image, I threw it away.

The result was [above]: you see rectangles marked with green that pass my filter and rectangles marked with red that don't. So you get rid of the bee, you get rid of part of the ruler -- more important, you get rid of a lot of the trash over here. [Pointing to small red rectangles on styrofoam.] There are some bugs in this--we end up getting rid of "Arizona" for reasons I need to look at--but it does clean the thing up pretty nicely.

Question: A very simple solution to this would be for the guys at Berkeley to take two photographs -- one of the bee and ruler, one of the labels. I'm just thinking how much simpler that would be.

Me: If the guys in Berkeley had a workflow that took the picture--even with the bee--agaist a black background, that would trivialize this problem completely!

Question: If the photos were taken against a background of wallpaper with random letters, it couldn't be much worse than this [styrofoam]. The idea is that you could make this a lot easier if you would go to the museums and say, we'll participate, we'll do your OCRing, but you must take photographs this way.

Me: You're absolutely right. You could even hand them a piece of cardboard that was a particular color and say, "Use this and we'll do it for you, don't use it and we won't." I completly agree. But this is what we're starting with, so this is what I'm working on.

The next thing is to aggregate all those word boxes into the labels [they constitute]. For each rectangle, look at all of the other rectangles in the system, expand them both a little bit, determine if they overlap, and if they do, consolidate them into a new rectangle, and repeat the process until there are no more consolidations to be done. [Thanks to Sara Brumfield for this algorithm.]

If you do that, the blue boxes are the consolidated rectangles. Here you see a rectangle around the U.C. Berkeley label, a rectangle around the collector, and a pretty glorious rectangle around the determination that does not include the border.

Having done that, you want to further filter those rectangles. Labels contain words, so you can reject any rectangles that were "primitives" -- you can get rid of the ruler rectangle, for example, because it was just a single [primitive] rectangle that was pretty large.

So you make sure that all of your rectangles were created through consolidation, then you crop the results. And you end up automatically extracting these images from that sample -- some of which are pretty good, some of which are not. We've got some extra trash here, we cropped the top of "Arizona" here. But for some of the labels -- I don't think I could do better than that determination label by hand.

Then you feed the results back into Tesseract one by one, then we combine the text files in Y-axis order to produce a single file for all those images. (Not something that's a necessary step, but that does allow us to compare the results with the "raw" OCR.) How did we do?

This is a resulting text file -- we've got a date that's pretty recognizable, we've got a label that's recognizable, and the determination is pretty nice.

Let's compare it to the raw result. In the cropped results, we somehow missed the "Cerceris compacta", we did a much nicer job on the date, and the determination is actually pretty nice.

Let's try it on a different specimen image.

We run the same process over this Stigmus image. We again find labels pretty well.

When we crop them out, the autocrop pulls them out into these three images.

Running those images through OCR, we get a comparison of the original, which had a whole lot of gibberish.

The original did a decent job with the specimen number, but the autocrop version does as well. In particular, for this location [field], the autocrop version is nearly perfect, whereas the original is just a mess.

My conclusion is that we can extract labels fairly effectly by first doing a naive pass of OCR and looking at the results of that, and that the results of OCR over the cropped images is less horrible than running OCR over the raw images -- though still not great.

[2013-02-15 update: See the results of this approach and my write-up of the iDigBio Augmenting OCR Hackathon itself.]

What does it mean to "support TEI" for manuscript transcription?

noreply@blogger.com (Ben W. Brumfield) — Sat, 10 Nov 2012 14:57:00 +0000

This is a transcript of my talk at the 2012 TEI meeting at Texas A&M University, "What does it mean to 'support TEI' for manuscript transcription: a tool-maker's perspective."

You can download an MP3 recording of the talk here.

Let's get started with a couple of definitions. All the tools and the sites that I'm reviewing are cloud based, which means that I'm ruling out--perhaps arbitrarily--any projects that involve people doing offline edition and then publishing that on the web. I'm only talking about online-based tools.

So that's a very strict definition of clouds, and I'm going to have a very loose and squishy definition of crowds, in which I'm talking about any sort of tool that allows collaborative editing of manuscript material, and not just ones that are directed at amateurs. That's important for a couple of reasons: one, because it gave me a sample size that was large enough to find out how people are using TEI, but--for another reason--because "amateurs" aren't really amateurs. What we see with crowdsourcing projects is that amateurs become experts very quickly. And given that your average user of any citizen science or historical crowdsourcing project is a woman over 50 who has at least a Master's degree, this isn't sort of the unwashed masses.

Okay, so crowdsourced transcription has been going on for a while, and it's been happening in four different traditions that developed this all independently. You have genealogists who are doing this, primarily with things like census records. The 1940 census is the most prominent example: they have volunteers transcribing as many as ten million records a day. The natural sciences are doing something similar, particularly GalaxyZoo, the OldWeather people are looking at climate change data, where you have to look at old, handwritten records to figure out how the climate has changed, because you need to know how the climate used to be. And then there are also some projects going on in the Open Source/Creative Commons world: the Wikisource people--particularly the German language Wikisource community--and libraries, archives, and museums have jumped into this recently.

So here are a couple of examples from the citizen science world. OldWeather has a tool that allows people to record ship log book entries and weather observations. As you can see, this is all field based -- this isn't quite an attempt to represent a document. We'll get back to this in a minute.

The North American Bird Phenology Program is transcribing old bird[-watching] observation cards from about a hundred years ago. They're recording species names and all sorts of other things about this particular Grosbeak in 1938.

All of these--this is the majority of the crowdsourced transcription that's happening out there--there are millions of records--there are millions of records that are happening that are all record based. These are not document-based, they aren't page-based. They're dealing with data that is fundamentally tabular -- those are their inputs. Their outputs are databases that they want to be able to either search or analyze. So we're producing nothing that anyone would ever want to print out.

And another interesting thing about this is that these record-based transcription projects--the uses are understood in advance. If you're building a genealogy index, you know that people are going to want to search for names and be able to see the results. And that's it -- you're not building something that allows someone to go off and do some other kind of analysis.

Now what kind of mark-up are these record-based transcription projects using? Well, it's kind of idiosyncratic, at best.

Here's an example from my client FreeREG. This is a mark-up language that they developed about ten years ago for indicating unclear readings of manuscripts. It's actually fairly sophisticated--it's based on the regular expression programming sub-language--but it's not anything that's informed by the TEI world.

On the other hand, here is the mark-up that the New York Public Library is using. Let me read this out to you: "Please type the text of the indicated dish exactly as it appears. Don't worry about accents." This is almost an anti-markup.

So what about free-form transcription? There's a lot of development of people doing free-form transcription. You have Scripto out of CHNM. You have a couple of different (perhaps competing) NARA initiatives. Wikisource. There's my own FromThePage. What kind of mark-up are they doing? Well, for the most part, none!

Here's Scripto--the Papers of the War Department-- and you type what you see, and that's what you get.

Here is the French-language Wikisource, hosting materials from the Archives departmentales du Cantal (who are doing some very cool things here). But this is just typing things into a wiki and not even internally using wiki links. This is almost pre-formed text -- it's pretty much plaintext.

My own project, FromThePage.

I'm internally using wiki-links, but really only for creating indexes and annotations, not for indicating...any of the power that you have with TEI.

So if no one is using TEI, why is TEI important? I think that TEI is important because crowdsourced transcription projects are how the public is interacting with edition. This is how people are learning what editing is, what the editing process is, and why and whether it's important. And they're using tools that are developed by people like me. Now how do people like me learn about edition?

The answer is, by reading the TEI Guidelines. The TEI Guidelines have an impact that goes far beyond people who are actually implementing TEI. I started work on FromThePage in complete isolation in 2005. By 2007, I was reading the TEI Guidelines. I wasn't implementing TEI, but the questions that were asked--these notions of "here's how you expand abbreviations", "here's how you regularize things"--had a tremendous impact on me. By contrast, the Guide to Documentary Editing--which is a wonderful book!--I only found out in January of this year.

TEI is online, it's concise, it's available. And when I talk to people in the genealogy development world, they know about TEI. They've heard of it. They have opinions. They're not using it, but -- you people are making an impact on how the world does edition!

Okay, so if all of these people aren't using TEI, who is doing it?

I run a transcription tool directory that is itself crowdsourced. It's been edited by 23 different people who've entered information about 27 different tools. Of those 27 tools, 7 are marked as "supporting TEI". There's a little column, "does it support TEI?", seven of them say "Yes".

Actually, that's not true. Some of them say "yes", but some of those seven say "well, sort of". So what does that mean?

To find that out, I interviewed five of those seven projects.

Transcribe Bentham.
T-PEN (which there's a poster session about tonight), which is a line-based system for medieval manuscripts.
A customization of T-PEN, the Carolingian Canon Law project, out of the University of Kentucky.
Our own Hugh Cayless for the Papyrological Editor, which is dealing with papyri.
And then MOM-CA is one of these "sort of"s. You have two implementations of it.

One of them is the Virtualles deutsches Urkundennetzwerk, which is a German charter collection. It supports "TEI, sort-of" -- actually it supports CEI and EAD.
But it's been customized for extensive TEI support for the Itinera Nova project which is out of the archive of Leuven, Belgium.

I'm going to talk about what I found out, but I'm going to emphasize Transcribe Bentham. Not because it's better than the other tools, but because they actually ran their transcription project as an experiment. They wanted to know, can the public do TEI? Can the public handle it? And they've published their results: they've conducted user surveys of what was your experience using TEI? Which makes it particularly useful for those of us who are trying to figure out how it's being used.

Okay, so there's a lot of variation among these projects. You've got a varied committment to TEI. Transcribe Bentham: Yes, we're going to use TEI! You see Melissa Terras here saying that "it was untenable" that we'd ask for anything else. These people know how to do it; why would we depart from that?

For T-PEN, James Ginther says: Hey, I'm kind of skeptical. We'll support any XSD you want to upload, if it happens to be TEI, that's okay.

Abigail Firey, who's using T-PEN, basically says: look, it's probably necessary. It's very useful. It lets us develop these valuable intellectual perspectives on our text. And she considered it important that their text encoding was done within the community of practice represented by the people in this room.

Okay, so more variation between these. Where's the TEI located within these projects? Where does it live? I'm a developer; I'm interested in the application stack.

It turns out that there's no agreement at all. Transcribe Bentham has people entering TEI in person. And then it's storing it off in a MediaWiki, using MediaWiki versioning, not actually putting [...] pages in one big TEI document.

On the other hand, Itinera Nova is actually storing everything in an XRX-based XML database. I mean, it is pure TEI on the back end. But none of the volunteers using Itinera Nova actually are typing any angle brackets. So we have a lot of variation here.

However, there was no variation when I asked people about encoding. There is a perfectly common perception that is: Encoding is hard!

And there are these great responses--that you can see both on the Transcribe Bentham blog and in their DHQuarterly paper that just came out, which I highly recommend--describing it as "too much markup", "unnecessarily complicated", "a hopeless nightmare", and the entire transcription process is "a horror."

But, lots of things are hard.

In my own experience with FromThePage, I have one user who has transcribed one thousand pages, but she does not like using any mark-up at all. She's contributing! She's contributing plaintext transcriptions, but I'm going back to add wikilinks. So it's not about the angle brackets. (Maybe square brackets have a problem too, I don't know.)

And fundamentally, transcribing--reading old manuscripts--is hard. "Deciphering Bentham's hand took longer than encoding," for over half of the Bentham respondents.

So there's more commonality: everyone wants to make encoding easier. How do we do that? There's a couple of different approaches. One approach--the most common approach--is using different kinds of buttons and menus to automate the insertion of tags. Which gets around (primarily) the need for people to memorize tag names and attributes, and--God help us--close tags.

So these are implemented--we've got buttons on T-PEN and CCL. We've got buttons on the TEI Toolbar. We've got menus on VdU and the Papyrological Editor.

And you can see them. Here's a screenshot of Jeremy Bentham. A couple of interesting things about this: it's very small, but we've got a toolbar at the top. We've got TEI text: angle-bracket D.E.L. Angle-bracket, slash, D.E.L. So we're actually exposing the TEI to users in Transcribe Bentham, though we're providing them with some buttons.

Those buttons represent a subset--I'll get to the selection of those tags later. Here's a more detailed description of what they do.

Here's what's going on with VdU. Only in this case, they're not actually exposing the angle brackets to the user. They're replacing all of these in a pseudo-WYSIWYG that allows people to choose from a menu and select text that then gets tagged.

Okay -- limitations of the buttons. There's a good limitation, which is that as users become more comfortable with TEI, they outgrow buttons. And this is something that the people at Transcribe Bentham reported to me. They're seeing a fair number of people just skip the buttons altogether and type angle brackets. Remember: these are members of the public who have never met any of the Transcribe Bentham people.

On the down side, users also ignore the buttons. Again users ignoring encoding, but in this case we've got something that's a little bit worse. Georg Vogeler is reporting something very interesting, which is that in a lot of cases, they were seeing users who were using print apparatus for doing this kind of work, and just ignoring the buttons -- going around them.

So the problem with using print-style notations. People are dealing with these print editions [notations] -- this can be a problem or it can be an opportunity. Papyri.info is viewing it that way. Itinera Nova is using it that way.

Papyri.info, their front-end interface for most users is Leiden+, which is a standard for marking up papyri. And, as you can see, users enter text in Leiden+, and that generates TEI. (EpiDoc TEI, I believe.)

This is the same kind of process that's done in Itinera Nova. In that case, they're using for notation whatever it is that the Leuven archives uses for their mark-up. And they're doing the same kind of transposition [ed: translation] of replacing their notation with TEI tags before they save it.

And this is actually what users see as they're typing. They don't see the TEI tags -- we're hiding the angle brackets from them.

So this is an alternative to buttons. And in my opinion, it's not that bad an alternative.

This hasn't been a problem for the Bentham people, however. It's a non-problem for them. And they are the most "crowdy", the most amateur-focused, and the most committed to a TEI interface.

Tim Causer went through and reviewed all of this and said, you know, it just doesn't happen. People are not using any print notation at all. They're using buttons. They're using angle-brackets by hand. They're not even using plaintext. They're using TEI. Their users are comfortable with TEI.

So what accounts for the difference between the experience of the VdU and the Transcribe Bentham people? I don't know. I've got a couple of theories about what might be going on.

One of them is really the corpus of texts we're working with. If you're only dealing with papyrus fragments, and you're used to a well-established way of notating them--that's been around since 1935 in the case of Leiden+--well, it's kind of hard to break out of that. On the other hand, there's not a single convention for print editions. There's all sorts of ways of indicating additions and deletions for print editions of more modern texts. So maybe it's a lack of a standard.

Or, maybe it's who the users are. Maybe scholars are stubborner, and amateurs are more tractable and don't have bad habits to break. I don't know! I don't know, but I'd be really interested in any other ideas.

Okay, how do these projects choose the tags that they're dealing with? We've got a very long quote, but I'm just going to read out a couple of little bits of them.

Really, choosing a subset of tags is important. Showing 67 buttons was not a good usability thing for T-PEN. And in particular, what they ended up doing was getting rid of the larger, structural set of markup, and focusing just on sort of phrase-level markup.

This also, I think, true if we go back a minute and look at Bentham. Here, again, we're talking phrase-level tags. We're not talking about anything beyond that.

Justin Tonra said that it was actually really hard to pare down the number of tags for Transcribe Bentham. He wanted to do more, but, you know, he's pleased with what they got. They didn't want to "overcomplicate the user's job."

Richard Davis, also with Transcribe Bentham, had a great deal of experience dealing with editors for EAD and other XML. And he said you're always dealing with this balance between usability and flexibility, and there's just not much way of getting around it. It's going to be a compromise, no matter what.

So what's the future for these projects that are using TEI for crowds? Well, if getting people up to speed is hard, and if nobody reads the help--as Valerie Wallace at one time said about their absolutely intimidating help page for Transcribe Bentham (you should look at it -- it's amazing!)--then what are the alternatives for getting people up to speed?

Georg Vogeler says that they are trying to come up with a way of teaching people how to use the tool and how to use the markup in almost a game-like scenario. We're not talking about the kind of Whak-a-Mole things that we sometimes see, but really just sort of leading people through Let's try this. Now let's try this. Now let's try this. Okay now you know how to deal with this [tool]. It's something that I think we're actually pretty familiar with from any other kinds of projects dealing with historic handwriting.: people have to come up to speed.

Another possibility is a WYSIWYG. Tim Causer announced the idea of spending their new Mellon grant on building a WYSIWYG for Transcribe Bentham's TEI. The blog entry is fascinating because he gets about seven user comments, some of which express a whole lot of skepticism that a WYSIWYG is going to be able to handle nested tagging in particular. Other ones of which make comments about the whole XML system and its usability in vivid prose, which is very worth reading.

And maybe combinations of these. So we have these intermediate notations -- Itinera Nova, for example, they're using this let's begin a strike-through with an equals sign (which is apparently what they've been using at that archive for a while). And the minute you type that equals sign in, you actually get a WYSIWYG strike-through that runs all the way through your transcript.

That may be the future. We'll see. I think that we have a lot of room for exploring different ways for handling this.

So let me wrap up and thank my interviewees.

Transcribe Bentham: Melissa Terras, Justin Tonra, Tim Causer, Richard Davis.

T-PEN: James Ginther, Abigail Firey
Papyri.info: Hugh Cayless, Tom Elliot
MOM-CA: Georg Vogeler and Jochen Graf

Questions

[All questions will be paraphrased in the transcript due to sound quality, and are not to be regarded as direct quotations without verification via the audio.]

Syd Bauman: Of the systems which allow users to type tags free-hand, what percentage come out well-formed?

Me: The only one that presents free-hand [tagging] is Transcribe Bentham. Tim [Causer] gets well-formed XML for most everything he gets. There is no validation being performed by that wiki, but what he's getting is pretty good. He says that the biggest challenge when he's post-processing documents is closing tags and mis-placed nesting.

Syd Bauman: I'd be curious about the exact percentages.

Me: Right. I'd have to go back and look at my interview. He said that it represents a pretty small percentage, like single digits of the submissions they get.

John Unsworth: Do any of the systems use keyboard short-cuts?

Me: I know of none that use hot-keys.

John Unsworth: Do you think that would be more or less desirable than the systems you've described?

Me: I really only see hot-keys as being desirable for projects that are using more recent and clearer documents. Speed of data-entry from the keyboard perspective doesn't help much when you're having to stare and zoom and scroll on a document that is as dense and illegible as Bentham or Greek papyri.

Elena Pierazzo [very faint audio]: In some cases it's hard to define which is the error: choosing the tags or reading the text. I've been working with my students on Transcribe Bentham--they're all TEI-aware--and to be honest it was hard. The difficulty was not the mark-up. In a sense we do sometimes forget in these crowdsourcing projects, that the text itself is very hard, so probably adding a level of complexity to the task via the mark-up is very difficult.

I have all respect and sympathy for the people who stick to the ideal of doing TEI, which I commend entirely. But in some cases, it may be that asking amateur people to do [the decipherment] and do the mark up is a pretty strong request, and makes a big assumption about what the people "out there" are capable of without formation.

Me: I'd agree with you. However, there have been some studies on these users' ability to produce quality transcripts outside of the TEI world.... Old Weather did a great deal of research on that, and they found that individual users tended to submit correct transcripts 97% of the time. They're doing blind triple-keying, so they're comparing people's transcripts against others. [They found] that of 1000 different entries, typically on average 13 will be wrong. Of those thirteen, three will be due to user error--so it does happen; I'm not saying people are perfect. Three will be generally[ed: genuinely] illegible. And the remaining seven will be due to the officer of the watch having written the wrong thing down and placing the ship in Afghanistan instead of in the Indian Ocean. So there are errors everywhere. [I mis-remembered the numbers here: actually it's 3 errors due to transcriber error, 10 genuinely illegible, and 3 due to error at time of inscription.]

Lou Burnard: The concept of error is a nuanced one. I would like to counter-argue Elena's [point]. I think that one of the reasons that Bentham has been successful is precisely because it's difficult material. Why do I think that? Because if you are faced with something difficult, you need something powerful to express your understanding of it. The problem with not using something as rich and semantically expressive as TEI when you're doing your transcription is that it doesn't exist! All you can do is type in the words you think it might have been, and possibly put in some arbitrary code to say, "Well, I'm not sure about that." Once you've mastered the semantics of the TEI markup--which doesn't actually take that long, if you're interested in it--now you can express yourself. Now you can communicate in a [...] satisfactory way. And I think that's why people like it.

Me: I have anecdotal, personal evidence to agree with you. In my own system (that does not use TEI), I have had users who have transcribed several pages, and then they'd get to a table in some biologist's field notes, for example, and they stop. And they say, "well, I don't know what to do here." So they're done.

Lou Burnard: The example you cite of the erroneous data in the source is a very good one, because if you've mastered TEI then you know how to express in markup: 'this is what it actually says but clearly he wasn't in Afghanistan.' And that isn't the case in any other markup system I've ever heard of.

[I welcome corrections to my transcript or the contents of the talk itself at benwbrum@gmail.com or in the comments to this post.]