<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:blogger="http://schemas.google.com/blogger/2008" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr="http://purl.org/syndication/thread/1.0" version="2.0"><channel><atom:id>tag:blogger.com,1999:blog-5730930067468816440</atom:id><lastBuildDate>Thu, 23 May 2013 10:06:45 +0000</lastBuildDate><category>images</category><category>paper</category><category>acquisition</category><category>mediawiki</category><category>blogland</category><category>names</category><category>tei</category><category>inspirations</category><category>requirements sxsw</category><category>programming</category><category>deployment</category><category>business plan</category><category>videos</category><category>indexing</category><category>open source</category><category>press</category><category>structured transcription</category><category>interview</category><category>hackathon</category><category>feature plan</category><category>licensing</category><category>rails</category><category>similar projects</category><category>fromthepage projects</category><category>features</category><category>nabpp</category><category>open access</category><category>ocr</category><category>podcasts</category><category>requirements</category><category>crowdsourcing</category><category>risks</category><category>progress</category><category>subject links</category><category>client projects</category><category>velehanden</category><category>thatcamp</category><category>money</category><category>presentations</category><title>Collaborative Manuscript Transcription</title><description /><link>http://manuscripttranscription.blogspot.com/</link><managingEditor>noreply@blogger.com (Ben W. Brumfield)</managingEditor><generator>Blogger</generator><openSearch:totalResults>120</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/CollaborativeManuscriptTranscription" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="collaborativemanuscripttranscription" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-792232024726417127</guid><pubDate>Tue, 14 May 2013 14:29:00 +0000</pubDate><atom:updated>2013-05-14T10:03:57.583-05:00</atom:updated><title>Typologie des méthodes de contrôle de la qualité dans les projets de crowdsourcing</title><description>&lt;i&gt;A translation of my 2012-03-05 post &lt;a href="http://manuscripttranscription.blogspot.com/2012/03/quality-control-for-crowdsourced.html"&gt;"Quality Control for Crowdsourced Transcription"&lt;/a&gt; which appeared in &lt;a href="http://www.bnf.fr/documents/crowdsourcing_rapport.pdf"&gt;"Etat de l’art en matière de Crowdsourcing dans les bibliothèques numériques"&lt;/a&gt; by Moirez, Moreaux, and Josse (2013), reproduced for Francophone readers:&lt;/i&gt;&lt;br /&gt;
&lt;ol type="A"&gt;
&lt;li&gt;&lt;b&gt;«Single-track methods»&lt;/b&gt;: le document ne fait l’objet que d’une seule
transcription (par un seul contributeur ou de façon collaborative ensemble sur le
même document)&amp;nbsp;&amp;nbsp;&lt;ol&gt;
&lt;li&gt;&lt;b&gt;«Open-ended community revison»&lt;/b&gt;: (Wikipédia) les utilisateurs peuvent continuer à modifier le texte transcrit, sans limite dans le temps. Un historique des modifications permet de revenir à la version précédente et d’éviter le vandalisme.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;«Fixed-term community revision»&lt;/b&gt; (Transcribe Bentham) : convient pour des projets d’édition plus traditionnels, dont l’objectif est la publication d’une “version finale”. Quand une transcription atteint un niveau acceptable, val
idée par les experts, elle est close et publiée.&amp;nbsp;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;«Community-controlled revision workflows»&lt;/b&gt; (Wikisource) : la transcription est considérée comme une “version finale” non plus par des experts, mais parce qu’elle a traversé un workflow collaboratif de correction/révision/validation
-&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;«Transcriptions with "known-bad" insertions before proofreading»&lt;/b&gt; : dans une première phase, les correcteurs sont invités à transcrire. Puis d’autres correcteurs révisent la transcription en la comparant au texte original; pour s’assurer que la seconde lecture est bien réalisée, des erreurs sont ajoutées dans le texte: si toutes les «fausses erreurs» sont corrigées, le système déduit que les «vraies erreurs» ont dû être corrigées aussi.&amp;nbsp;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;«Single-keying with expert review»&lt;/b&gt; : lorsqu’une transcription a été réalisée par un contributeur, elle est validée ou rejetée par un expert (soit un professionnel de l’institution à l’origine du projet, soit un contributeur sélectionné). Si la correction est rejetée, elle est soit à nouveau soumise à correction, soit corrigée par l’expert et validée.&amp;nbsp;&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;«Multi-track methods»&lt;/b&gt;: ces méthodes conviennent particulièrement à des corrections portant sur des données structurées ou des micro-tâches. La même image de départ est présentée à plusieurs contributeurs qui transcrivent chacun à partir de zéro. Généralement, les contributeurs ne savent pas s’ils sont les premiers correcteurs ou si d’autres transcriptions ont déjà été soumises. Puis les données ainsi collectées sont comparées automatiquement.&amp;nbsp;
&lt;ol start="6"&gt;
&lt;li&gt;&lt;b&gt;«Triple-keying with voting»&lt;/b&gt; (Old Weather, ReCAPTCHA) : l’image est présentée à 3 contributeurs, la majorité l’emporte (au depart, Old Weather proposait l’image à 10 contributeurs, mais ils se sont aperçus que la pertinence était sensiblement la même avec 3 qu’avec 10 contributeurs)&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;«Double-keying with expert reconciliation»&lt;/b&gt;: la même donnée est présentée à deux contributeurs, et, s’ils ne sont pas d’accord entre eux, un expert tranche. &lt;/li&gt;
&lt;li&gt;&lt;b&gt;«Double-keying with emergent community-expert reconciliation»&lt;/b&gt; (FamilySearch Indexing): la method est presque similaire à la précédente, sauf que l’expert qui tranche entre deux corrections divergentes est lui-même un contributeur, qui a été promu conciliateur grâce à l’analyse automatique de ses contributions (volume,pertinence).&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;b&gt;«Double-keying with N-keyed run-off votes»&lt;/b&gt;: si les deux contributeurs ne sont pas d’accord, la correction est re-proposée à un nouveau duo/trio d’usagers. &lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/qEnWVtfy5ys" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2013/05/typologie-des-methodes-de-controle-de.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>1</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-8526608883141007000</guid><pubDate>Mon, 29 Apr 2013 18:23:00 +0000</pubDate><atom:updated>2013-05-07T10:57:55.628-05:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">presentations</category><category domain="http://www.blogger.com/atom/ns#">crowdsourcing</category><category domain="http://www.blogger.com/atom/ns#">similar projects</category><category domain="http://www.blogger.com/atom/ns#">tei</category><title>Itinera Nova in the World(s) of Crowdsourcing and TEI</title><description>On April 25, 2013, I presented this talk at the &lt;a href="http://www.leuven.be/itineranova-english"&gt;International Colloquium Itinera Nova&lt;/a&gt; in Leuven, Belgium.  It was a fantastic experience, which I plan to post (and speak) more about, but I wanted to get my slides and transcript online as soon as possible.&lt;br /&gt;
&lt;br /&gt;
&lt;span style="font-size: x-small;"&gt;&lt;i&gt;Abstract: Crowdsourcing for cultural heritage material has become increasingly
popular over the last decade, but manuscript transcription has become
the most actively studied and widely discussed crowdsourcing activity
over the last four years. However, of the thirty collaborative
transcription tools which have been developed since 2005, only a
handful attempt to support the Text Encoding Initiative (TEI) standard
first published in 1990. What accounts for the reluctance to adopt
editorial best practices, and what is the way forward for crowdsourced
transcription and community edition? This talk will draw on interviews
with the organizers behind Transcribe Bentham, MoM-CA, the
Papyrological Editor, and T-PEN as well as the speaker's own
experience working with transcription projects to situate Itinera Nova
within the world of crowdsourced transcription and suggest that
Itinera Nova's approach to mark-up may represent a pragmatic future
for public editions.&lt;/i&gt;&lt;/span&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-GlceWoSo9fw/UX5KK_TxXhI/AAAAAAAAApE/Jpx0vqOPdG4/s1600/itinera_nova+-+01.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-GlceWoSo9fw/UX5KK_TxXhI/AAAAAAAAApE/Jpx0vqOPdG4/s320/itinera_nova+-+01.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
I'd like to talk about &lt;a href="http://freyja.uni-koeln.de:8585/in/home"&gt;Itinera Nova&lt;/a&gt; within the world of crowdsourced transcription tools, which means that I need to talk a little bit about crowdsourced transcription tools themselves, and their history, and the new things that Itinera Nova brings.
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-N0tpYxhoctk/UX5KLERLkUI/AAAAAAAAApM/FAaY6XJDAxA/s1600/itinera_nova+-+02.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-N0tpYxhoctk/UX5KLERLkUI/AAAAAAAAApM/FAaY6XJDAxA/s320/itinera_nova+-+02.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Crowdsourced transcription has actually been around for a long time.  Starting in the 1990s we see a number of what are called "offline" projects.  This is before the term crowdsourcing was invented.&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;A Dutch initiative: &lt;a href="http://www.vpnd.nl/"&gt;Van Papier naar Digitaal&lt;/a&gt; which is transcribing primarily genealogy records.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://freebmd.org.uk/"&gt;FreeBMD&lt;/a&gt;, &lt;a href="http://www.freereg.org.uk/index.shtml"&gt;FreeREG&lt;/a&gt;, and &lt;a href="http://www.freecen.org.uk/"&gt;FreeCEN&lt;/a&gt; in the UK, transcribing church registers and census records.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://demogen.arch.be/info_demogen.php"&gt;Demogen&lt;/a&gt; in Belgium -- I don't know a lot about this -- it appears to be dead right now, but if anyone  can tell me more about this, I'd like to talk after this.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.sa.dk/content/dk/ao-forside"&gt;Archivalier Online&lt;/a&gt;--also transcribing census records--in Denmark,&amp;nbsp;&lt;/li&gt;
&lt;li&gt;And a series of projects by the &lt;a href="http://data.wmgs.org/"&gt;Western Michigan Genealogy Society&lt;/a&gt; to transcribe local census records and also to create indexes of obituaries.  &lt;/li&gt;
&lt;/ul&gt;
One thing these have in common, you'll notice, is that these are all genealogists.  They are primarily interested in person names and dates.  And they emerge out of an (at least) one hundred year old tradition of creating print indexes to manuscript sources which were then published.  Once the web came online, the idea of publishing these on the web [instead] became obvious.  But the tools that were used to create these were spreadsheets that people would use on their home computers.  Then they would put CD ROMs or floppy disks in the posts and send them off to be pubished online.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-UnpaSK9lV7o/UX5KLK4XvnI/AAAAAAAAApI/x0wVkdkkqYc/s1600/itinera_nova+-+03.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-UnpaSK9lV7o/UX5KLK4XvnI/AAAAAAAAApI/x0wVkdkkqYc/s320/itinera_nova+-+03.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Really the modern era of crowdsourced transcription begins about eight years ago.&amp;nbsp; There are a number of projects that begin development in 2005.&amp;nbsp; They are released (even though they've been in development for a while) starting around 2006.&amp;nbsp; &lt;a href="https://familysearch.org/indexing/"&gt;Familysearch Indexing&lt;/a&gt; is, again, a genealogy system primarily concerned with records of genealogical interest which are tabular.&amp;nbsp; It is put up by the Mormon Church.&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
Then things start to change a little bit.&amp;nbsp; In 2008, I publish &lt;a href="http://fromthepage.com/"&gt;FromThePage&lt;/a&gt;, which is not designed for genealogy records per se -- rather it's designed for 19th and 20th century diaries and letters.&amp;nbsp; (So here we have more complex textual documents.)&amp;nbsp; Also in 2008, &lt;a href="http://en.wikisource.org/"&gt;Wikisource&lt;/a&gt;--which had been a development of Wikipedia to put primary sources online--start using a transcription tool.&amp;nbsp; But initially, they're not using it for manuscripts because of policy in the English, French, and Spanish language Wikisources.&amp;nbsp; The only people using it for manuscripts are the German Wikisource community, which has always been slightly separate.&amp;nbsp; So they start transcribing free-form textual material like war journals &lt;i&gt;[ed: memoirs]&lt;/i&gt; and letters.&amp;nbsp; But again, we have a departure from the genealogy world.&lt;br /&gt;
&lt;br /&gt;
In 2009, the &lt;a href="http://www.pwrc.usgs.gov/bpp/"&gt;North American Bird Phenology Program&lt;/a&gt; starts transcribing bird observations.&amp;nbsp; So in the 1880s you had amateur bird-watchers who would go into the field and they would record their sightings of certain ducks, or geese, or things like that, and they would record the location and the birds they had observed.&amp;nbsp; So we have this huge database of the presences of species throughout North America that is all on index cards.&amp;nbsp; And as the climate changes and habitats change, those species are no longer there.&amp;nbsp; So scientists who want to study bird migration and climate change need access to these.&amp;nbsp; But they're hand-written on 250,000 index cards, so they need to be transformed.&amp;nbsp; So that requires transcription, also by volunteers. &lt;i&gt;[ed: The correct number of cards is over 6 million, according to Jessica Zelt's &lt;a href="https://www.idigbio.org/content/idigbio-public-participation-digitization-biodiversity-specimens-workshop-phenology-program"&gt;"Phenology Program (BPP):  Reviving a Historic Program in the Digital Era"&lt;/a&gt;]&lt;/i&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-0r3ydtmrzSw/UX5KLs6geII/AAAAAAAAApQ/m2gqpsMeqnE/s1600/itinera_nova+-+04.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-0r3ydtmrzSw/UX5KLs6geII/AAAAAAAAApQ/m2gqpsMeqnE/s320/itinera_nova+-+04.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
2010 is the year that crowdsourced transcription really gets big.&amp;nbsp; The first big development is the &lt;a href="http://oldweather.org/"&gt;Old Weather&lt;/a&gt; project, which comes out of the Citizen Science Alliance and the Zooniverse team that got started with GalaxyZoo.&amp;nbsp; The problem with studying climate change isn't knowing what the climate is like now.&amp;nbsp; It is very easy to point a weather satellite at the South Pacific right now.&amp;nbsp; The problem is that you can't point a weather satellite at the South Pacific in 1911.&amp;nbsp; Fortunately, in many of the world's navies, the officer of the watch would, every four hours, record the barometric pressure, the temperature, the wind speed and direction, the latitude and the longitude in the ships logs.&amp;nbsp; So all we have to do is type up every weather observation for all the navies' ships, and suddenly we know what the climate was like.&amp;nbsp; Well, they've actually succeeded at this point -- in 2012 they &lt;a href="http://blog.oldweather.org/2012/07/23/one-million-six-hundred-thousand-new-observations/"&gt;finished&lt;/a&gt; transcribing all the British Royal Navy's ships log weather observations during World War I.&amp;nbsp; So this has been very successful -- it's a monumental effort: they have over six hundred thousand registered accounts--not all of those are active, but they have a very large number of volunteers.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-KZQMvpnwSz4/UX5KLx_X33I/AAAAAAAAApU/6SvAAMGzPwA/s1600/itinera_nova+-+05.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-KZQMvpnwSz4/UX5KLx_X33I/AAAAAAAAApU/6SvAAMGzPwA/s320/itinera_nova+-+05.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Also in 2010 in the UK, &lt;a href="http://www.transcribe-bentham.da.ulcc.ac.uk/td/Transcribe_Bentham"&gt;Transcribe Bentham&lt;/a&gt; goes live.&amp;nbsp; (We'll talk a lot more about this -- it's a very well documented project.)&amp;nbsp; This is a project to transcribe the notes and papers of the utilitarian philosopher Jeremy Bentham.&amp;nbsp; It's very interesting technically, but it was also very successful drawing attention to the world of crowdsourced transcription.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-p17aA9QbptE/UX5KMmkh7XI/AAAAAAAAAps/jighuEZID9E/s1600/itinera_nova+-+06.jpg" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-p17aA9QbptE/UX5KMmkh7XI/AAAAAAAAAps/jighuEZID9E/s320/itinera_nova+-+06.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
In 2011, the Center for History and New Media at George Mason University in northern Virginia published the &lt;a href="http://wardepartmentpapers.org/"&gt;Papers of the United States War Department&lt;/a&gt;, and builds a tool called &lt;a href="http://scripto.org/"&gt;Scripto&lt;/a&gt; that plugs into it.&amp;nbsp; Now this is primarily of interest to military and social historians, but again we're getting away from the world of genealogy, we're getting away from the world of individual tabular records, and we're getting into dealing with documents. &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-JuO6JKqrHXc/UX5KMD2-i-I/AAAAAAAAApg/QyVDcJ4NJ_U/s1600/itinera_nova+-+07.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-JuO6JKqrHXc/UX5KMD2-i-I/AAAAAAAAApg/QyVDcJ4NJ_U/s320/itinera_nova+-+07.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Once we get there, we have a tension.&amp;nbsp; And this is a pretty common tension.&amp;nbsp; There's an institutional tension, in that editing of documents has historically been done by professionals, and amateur editions have very bad reputations.&amp;nbsp; Well now we're asking volunteers to transcribe.&amp;nbsp; And there's a big tension between, &lt;i&gt;well how do volunteers deal with this [process], do we trust volunteers?&amp;nbsp; Wouldn't it be better just to give us more money to hire more professionals?&lt;/i&gt;&amp;nbsp; So there's a tension there.&lt;br /&gt;
&lt;br /&gt;
There's another tension that I want to get into here, since today is 
the technical track, and that's the difference between easy tools and 
powerful tools, and [the question of] making powerful tools easy to 
use.&amp;nbsp; This is common to all technology--not just software, and certainly
 not just crowdsourced transcription--but it's new because this is the 
first time we're asking people to do these sorts of transcription 
projects.&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
Historically these professional [projects] have been done using mark-up 
to indicate deletions or abbreviations or things like that.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-yHv_U_vo6MA/UX5KMXoef1I/AAAAAAAAApk/YDGgAXauWnk/s1600/itinera_nova+-+08.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-yHv_U_vo6MA/UX5KMXoef1I/AAAAAAAAApk/YDGgAXauWnk/s320/itinera_nova+-+08.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So there's this fear: &lt;i&gt;what happens when you take amateurs and add mark-up?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
Well, what is going to happen?&amp;nbsp; Well, one solution--and it's a 
solution that I'm distressed to say is becoming more and more popular in
 the United States--is to get rid of the mark-up, and to say, &lt;i&gt;well, let's just ask them to type plain text&lt;/i&gt;.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-73I5yhe016s/UX5KMiLgI3I/AAAAAAAAAp0/0yOxYJWuTVE/s1600/itinera_nova+-+09.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-73I5yhe016s/UX5KMiLgI3I/AAAAAAAAAp0/0yOxYJWuTVE/s320/itinera_nova+-+09.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
There's a problem with this.&amp;nbsp; Which is that giving users power to represent what they see--to do the tasks that we're asking them to do--enables them.&amp;nbsp; Lack of power frustrates them.&amp;nbsp; &lt;b&gt;And when you're asking people to transcribe documents that are even remotely complex, mark-up is power.&lt;/b&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-YZGUO7KgHWc/UX5KMt5L6cI/AAAAAAAAApw/6SaHeajKeDU/s1600/itinera_nova+-+10.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-YZGUO7KgHWc/UX5KMt5L6cI/AAAAAAAAApw/6SaHeajKeDU/s320/itinera_nova+-+10.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So I'm going to tell a little story about scrambled eggs.&amp;nbsp; These are not the scrambled eggs that I ate this morning--which were delicious by the way--but they're very similar.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-TY6L09MQUeE/UX5KNCfdL2I/AAAAAAAAAp4/s86vGXEKAyY/s1600/itinera_nova+-+11.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-TY6L09MQUeE/UX5KNCfdL2I/AAAAAAAAAp4/s86vGXEKAyY/s320/itinera_nova+-+11.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
I'm going to pick on my friends at the New York Public Library, who in 2011 launched the &lt;a href="http://menus.nypl.org/"&gt;"What's on the Menu?" project&lt;/a&gt;.&amp;nbsp; They have an enormous collection of menus from around the world, and they want to track to culinary history of the world as dishes originate in one spot and move to other locations, the change in dishes--&lt;i&gt;when did anchovies become popular?&amp;nbsp; Why are they no longer popular?&lt;/i&gt;--things like that.&amp;nbsp; So they're asking users to transcribe all of these menu items.&amp;nbsp; They developed a very elegant and simple UI.&amp;nbsp; This UI did not involve mark-up; this is plain-text.&amp;nbsp; In fact--I'm going to get over here and read this--if you look at this instruction, this is almost &lt;b&gt;stripped&lt;/b&gt; &lt;b&gt;text&lt;/b&gt;: &lt;i&gt;"Please type the text of the indicated dish exactly as it appears.&amp;nbsp; Don't worry about accents."&lt;/i&gt;&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-HP617ZZzElE/UX5KNFnDFlI/AAAAAAAAAqA/Si6ZaSlR-g4/s1600/itinera_nova+-+12.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-HP617ZZzElE/UX5KNFnDFlI/AAAAAAAAAqA/Si6ZaSlR-g4/s320/itinera_nova+-+12.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Well, this may not be a problem for Americans, but it turns out that some of their menus are in languages that contain things that American developers might consider accents.&amp;nbsp; This is a menu that was published on their site in 2011.&amp;nbsp; They sent out an appeal asking, &lt;i&gt;"can anyone read Sütterlin or old German Kurrentschrift"?&lt;/i&gt;&amp;nbsp; I saw this and I went over to a chat channel for people who are discussing German and the German language, because I knew that there were some people familiar with German paleography there, and I wanted to try it out.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-v47aXOa7wcY/UX5KNf8guBI/AAAAAAAAAp8/27pWKgaQbWg/s1600/itinera_nova+-+13.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-v47aXOa7wcY/UX5KNf8guBI/AAAAAAAAAp8/27pWKgaQbWg/s320/itinera_nova+-+13.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So the transcribers are going through and they're transcribing things, and they get to this entry: &lt;b&gt;Rühreier&lt;/b&gt;.&amp;nbsp; All right, let's transcribe that without accents.&amp;nbsp; So they type in what they see.&amp;nbsp; &lt;a href="http://de.wikipedia.org/wiki/R%C3%BChrei"&gt;&lt;i&gt;Rühreier&lt;/i&gt;&lt;/a&gt; is scrambled eggs.&amp;nbsp; And what they type is converted to &lt;b&gt;"Ruhreier"&lt;/b&gt;, which are... eggs from the &lt;a href="http://en.wikipedia.org/wiki/Ruhrgebiet"&gt;Ruhrgebiet&lt;/a&gt;?&amp;nbsp; I don't know?&amp;nbsp; This is not a dish.&amp;nbsp; I'm not familiar with German cuisine, but I don't think that the Ruhr valley is famous for its eggs.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-WfdGO6ohXyI/UX5KNhOxHNI/AAAAAAAAAqE/DbBgxdTafUU/s1600/itinera_nova+-+14.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-WfdGO6ohXyI/UX5KNhOxHNI/AAAAAAAAAqE/DbBgxdTafUU/s320/itinera_nova+-+14.jpg" style="cursor: move;" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
And this is incredibly frustrating!&amp;nbsp; We see in the chat room logs: &lt;i&gt;"Man, I can't get rid of 'Ruhreier' and this (all-capital) 'OMELETTE'!&amp;nbsp; What's going on?&amp;nbsp; Is someone adding these back?&amp;nbsp; Can you try to change "Ruhreier" to "R&lt;/i&gt;&lt;i&gt;ühreier"?&amp;nbsp; It keeps going back!"&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
So we have this frustration.&amp;nbsp; We have this potential to lose users when we abandon mark-up; when we don't give them the tools to do the job that we're asking them to do.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-Iz2IDjX1UTI/UX5KNtyRuvI/AAAAAAAAAqI/QN8sOnBb8NI/s1600/itinera_nova+-+15.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-Iz2IDjX1UTI/UX5KNtyRuvI/AAAAAAAAAqI/QN8sOnBb8NI/s320/itinera_nova+-+15.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Okay.&amp;nbsp; Let's shift gears and talk about a different world.&amp;nbsp; This is the world of TEI, the Text Encoding Initiative.&amp;nbsp; It's regarded as the ultimate in mark-up -- &lt;a href="http://www.hki.uni-koeln.de/manfred-thaller-dr-phil-prof"&gt;Manfred [Thaller]&lt;/a&gt; mentioned it some time earlier.&amp;nbsp; It's been a standard since 1990, and it's ubiquitous in the world of scholarly editing.&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
Remember, up until recently, all scholarly editing was done by professionals.&amp;nbsp; These professionals were using offline tools to edit this XML which Manfred described as a "labyrinth of angle brackets."&amp;nbsp; It was never really designed to be hand-edited, but that's what we're doing.&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
And because it's ubiquitous and because it's old, there's a perception among at least some scholars, some editors, that this is just a 'boring old standard'.&amp;nbsp; I have a colleague who did a set of interviews with scholars about evaluating digital scholarship, and not all but some of the responses she got when she brought up TEI were &lt;i&gt;"TEI?&amp;nbsp; Oh, that's just for data entry."&lt;/i&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-vMmBUPmU7Xw/UX5KN3UC-UI/AAAAAAAAAqQ/MGeRmrGEmeY/s1600/itinera_nova+-+16.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-vMmBUPmU7Xw/UX5KN3UC-UI/AAAAAAAAAqQ/MGeRmrGEmeY/s320/itinera_nova+-+16.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Well, not quite.&amp;nbsp; TEI has some strengths.&amp;nbsp; It is an incredibly powerful data model.&amp;nbsp; The people who are doing this--these professionals who have been working with manuscripts for decades--they've developed very sophisticated ways of modeling additions to texts, deletions to texts, personal names, foreign terms -- all sorts of ways of marking this up.&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
It has great tools for presentation and analysis.&amp;nbsp; Notice I didn't say transcription.&lt;br /&gt;
&lt;br /&gt;
And it has a very active community, and that community is doing some really exciting things.&lt;br /&gt;
&lt;br /&gt;
I want to use just one example of something that has only been around in the last four years that it's been developed.&amp;nbsp; It's a module that was created for TEI called the &lt;a href="http://wiki.tei-c.org/index.php/Genetic_Editions"&gt;Genetic Edition&lt;/a&gt; module.&amp;nbsp; A "genetic edition" is the idea of studying a text as it changes -- studying the changes that an author has made as they cross&amp;nbsp; through sections and created new sections, or over-written pieces.&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
So it's very sophisticated, and I want to show you the sorts of things you can do [with it] by demostrating an example of one of these presentation tools by Elena Pierazzo and Julie Andre.&amp;nbsp; Elena's at King's College London, and they developed this last year.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-QNHM_5jD-GY/UX5KOGsvVQI/AAAAAAAAAqU/U5DpSnRc-xQ/s1600/itinera_nova+-+17.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-QNHM_5jD-GY/UX5KOGsvVQI/AAAAAAAAAqU/U5DpSnRc-xQ/s320/itinera_nova+-+17.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
This is a draft of--I believe it's Proust's &lt;i&gt;Recherches du Temps Perdu&lt;/i&gt;--unfortunately I can't see up there.&amp;nbsp; But as you can see, this is a very complicated document.&amp;nbsp; The author has struck through sections and over-written them.&amp;nbsp; He's indicated parts moved.&amp;nbsp; He's even -- if you look over here -- he's pasted on an extra page to the bottom of this document.&amp;nbsp; So if you can transcribe this to indicate those changes, then you can visualize them.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-6iJym-5Zn2k/UX5KOOdX9kI/AAAAAAAAAqY/64AosrLQw_0/s1600/itinera_nova+-+18.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-6iJym-5Zn2k/UX5KOOdX9kI/AAAAAAAAAqY/64AosrLQw_0/s320/itinera_nova+-+18.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;i&gt;[Demo screenshots from the &lt;a href="http://research.cch.kcl.ac.uk/proust_prototype/"&gt;Proust Prototype&lt;/a&gt;.]&lt;/i&gt; And as you slide, you see transcripts appear on the page in the order that they're created, &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-X-4IK9K1KV4/UX5KObdqMHI/AAAAAAAAAqc/FbmQUEjYKxY/s1600/itinera_nova+-+19.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-X-4IK9K1KV4/UX5KObdqMHI/AAAAAAAAAqc/FbmQUEjYKxY/s320/itinera_nova+-+19.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
And in the order that they're deleted even.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-IhAAYlEebek/UX5KauwjNMI/AAAAAAAAArY/2fFQhCY2oDM/s1600/itinera_nova+-+23.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-IhAAYlEebek/UX5KauwjNMI/AAAAAAAAArY/2fFQhCY2oDM/s320/itinera_nova+-+23.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
There's even rotation and stuff -- &lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-7rwFxK94xXs/UX5Kb6FJ0sI/AAAAAAAAAr8/K3wfHHf5WHE/s1600/itinera_nova+-+26.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-7rwFxK94xXs/UX5Kb6FJ0sI/AAAAAAAAAr8/K3wfHHf5WHE/s320/itinera_nova+-+26.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
It's just a brilliant visualization!&lt;br /&gt;
&lt;br /&gt;
So this is the kind of thing that you can do with this powerful data model.&amp;nbsp;&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-MrQVfZPE1yU/UX5KbcFD34I/AAAAAAAAArw/biqsZ8KHs9Y/s1600/itinera_nova+-+27.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-MrQVfZPE1yU/UX5KbcFD34I/AAAAAAAAArw/biqsZ8KHs9Y/s320/itinera_nova+-+27.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
But how was that encoded? How did you get there?&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-G_wsWjKvklo/UX5Kbzg0JuI/AAAAAAAAAsE/195X5kDwJEM/s1600/itinera_nova+-+28.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-G_wsWjKvklo/UX5Kbzg0JuI/AAAAAAAAAsE/195X5kDwJEM/s320/itinera_nova+-+28.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Well, in this case, this is &lt;a href="http://www.tei-c.org/Activities/Council/Working/tcw19.html"&gt;an extension&lt;/a&gt; to that thousand-page book.&amp;nbsp; It's only about fifty pages long, printed, and it contains individual sets of guidelines.&amp;nbsp; In this case, this is how Henrik Ibsen clarified a letter.&amp;nbsp; In order to encode this, you use this &lt;code&gt;rewrite&lt;/code&gt; tag with a &lt;code&gt;cause&lt;/code&gt;...&amp;nbsp; And this is that forest of angle brackets; this is very hard.&amp;nbsp; And this is only one item from this document of instructions, which was small enough that I could cut it out and fit it on a slide.&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
So this is incredibly complex.&amp;nbsp; So if TEI is powerful; and if, as it gets more complex, it becomes harder to hand-encode; and as we start inviting members of the public and amateurs to participate in this work, how are we going to resolve this?&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-T85AI_BO5Hk/UX5Kb-tS-FI/AAAAAAAAAsA/VgAXedczFmk/s1600/itinera_nova+-+29.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-T85AI_BO5Hk/UX5Kb-tS-FI/AAAAAAAAAsA/VgAXedczFmk/s320/itinera_nova+-+29.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
If there's a fear about combining amateurs and mark-up, what do we do when we combine amateurs with TEI?&amp;nbsp; This is panic!&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
And it is very rarely attempted.&amp;nbsp; I maintain a directory of crowdsourced transcription tools, with multiple projects per tool.&amp;nbsp; And of the 29 projects in this directory, only 7 claim to support TEI.&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
One of them is Itinera Nova.&amp;nbsp; I found out about this when I was preparing a presentation for the TEI conference last year, in which I interviewed people running projects doing this crowdsourcing, and found out about their experience of users trying to encode in TEI, and asked, "Do you know anyone else?"&lt;br /&gt;
&lt;br /&gt;
And that's how I found out about Itinera Nova, which is unfortunately not very well known outside of Belgium.&amp;nbsp; This is something that I hope to part of correcting, because you have a hidden gem here -- you really do.&amp;nbsp; It is amazing.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-1Y_zvuYiix8/UX5KcNyjl6I/AAAAAAAAAsI/UixawS1u8Hw/s1600/itinera_nova+-+30.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-1Y_zvuYiix8/UX5KcNyjl6I/AAAAAAAAAsI/UixawS1u8Hw/s320/itinera_nova+-+30.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So how do you support TEI?&amp;nbsp; Well, one approach--the most common approach--is to say we'll have our users enter TEI, but we'll give them help.&amp;nbsp; We'll create buttons that add tags, or menus that add tags.&amp;nbsp; This has been the approach taken by &lt;a href="http://t-pen.org/TPEN/"&gt;T-PEN&lt;/a&gt; (created by the Center for Digital Thelogy out of Saint Louis University), and a project associated with them, the&amp;nbsp; &lt;a href="http://ccl.rch.uky.edu/"&gt;Carolingian Canon Law Project&lt;/a&gt;.&amp;nbsp; It's also the approach taken by Transcribe Bentham with their TEI toolbar.&amp;nbsp; Menus are an alternative, but essentially the do the same thing -- they're a way of keeping users from typing angle brackets.&amp;nbsp; So the &lt;a href="http://www.vdu.uni-koeln.de/vdu/home"&gt;Virtuelles deutsches Urkundennetzwerk&lt;/a&gt; is one of those, as well as the &lt;a href="http://papyri.info/"&gt;Papyrological Editor&lt;/a&gt; which is used by scholars studying Greek papyri. &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-5atiw-EL3Qs/UX5KcaNQi2I/AAAAAAAAAsQ/32je-hGtA9Y/s1600/itinera_nova+-+31.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-5atiw-EL3Qs/UX5KcaNQi2I/AAAAAAAAAsQ/32je-hGtA9Y/s320/itinera_nova+-+31.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So how well does that work?&amp;nbsp; You provide users with buttons that add tags to their text.&amp;nbsp; Here's an example from Transcribe Bentham.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-uUhl5w5vq3M/UX5KcnJ0ZcI/AAAAAAAAAsU/KOY5DnA_QrU/s1600/itinera_nova+-+32.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-uUhl5w5vq3M/UX5KcnJ0ZcI/AAAAAAAAAsU/KOY5DnA_QrU/s320/itinera_nova+-+32.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Here's an example from &lt;a href="http://www.monasterium.net/"&gt;Monasterium&lt;/a&gt;.&amp;nbsp; And the results are still very complicated.&amp;nbsp; The presentation here is hard.&amp;nbsp; It's hard to read; it's hard to work with.&lt;br /&gt;
&lt;br /&gt;
That does not mean that amateurs cannot do it at all!&amp;nbsp; Certainly the experience of Transcribe Bentham proves that amateurs to the same level as any professional transcriber, using these tools and coding these manuscripts, even without the background.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-wgn15tandSk/UX5Kj-KlpvI/AAAAAAAAAtc/M8qKxd_8sw8/s1600/itinera_nova+-+33.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-wgn15tandSk/UX5Kj-KlpvI/AAAAAAAAAtc/M8qKxd_8sw8/s320/itinera_nova+-+33.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
But there are limitations.&amp;nbsp; One limitation is that users outgrow buttons.&amp;nbsp; In Transcribe Bentham, [the most active] users eventually just started typing the angle brackets themselves -- they returned to that labyrinth of angle brackets of TEI tags.&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
Another problem is more interesting to me, which is when users ignore buttons.&amp;nbsp; Here we have one editor who's dealing with German charters, who uses these double-pipes instead of the line break tag, because this is what he was used to from print.&amp;nbsp; This speaks to something very interesting, which is that we have users who are used to their own formats, they're used to their own languages for mark-up, they're used to their own notations from print editions that they have either read or created themselves.&amp;nbsp; And &lt;b&gt;by asking them to switch over to this style of tagging, we're asking them not just to learn something new, but also to abandon what they may already know.&lt;/b&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-pjTSkxzBWtg/UX5KdFtNyUI/AAAAAAAAAso/z_vWCmUe6dU/s1600/itinera_nova+-+34.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-pjTSkxzBWtg/UX5KdFtNyUI/AAAAAAAAAso/z_vWCmUe6dU/s320/itinera_nova+-+34.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
And, frankly, it's really hard to figure out which buttons [to support].&amp;nbsp; Abigail Firey of the Carolingian Canon Law Project talks about how when they were designing their interface, they had 67 buttons.&amp;nbsp; This is very hard to navigate, and the users would just give up and start typing angle brackets instead, because buttons aren't a magic solution.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-NSmaV2jZ7vU/UX5KdHRVA4I/AAAAAAAAAsk/0bsO6mExAMA/s1600/itinera_nova+-+35.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-NSmaV2jZ7vU/UX5KdHRVA4I/AAAAAAAAAsk/0bsO6mExAMA/s320/itinera_nova+-+35.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
This is where Itinera Nova comes in.&amp;nbsp; The "intermediate notation" that Professor Thaller was talking about is quite clear-cut, and it maps well to the print notations that volunteers are already used to.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-cXT-26Za4cI/UX5KdxQjACI/AAAAAAAAAtA/16vUKxwaXuw/s1600/itinera_nova+-+36.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-cXT-26Za4cI/UX5KdxQjACI/AAAAAAAAAtA/16vUKxwaXuw/s320/itinera_nova+-+36.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
And what's interesting about this is that what many people may not realize is that Itinera Nova--despite having a very clear, non-TEI interface--has full TEI under the hood.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-2C97_IClhQ8/UX5Kd1wKOAI/AAAAAAAAAtE/6mkXtYFLR5U/s1600/itinera_nova+-+37.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-2C97_IClhQ8/UX5Kd1wKOAI/AAAAAAAAAtE/6mkXtYFLR5U/s320/itinera_nova+-+37.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Everything is persisted in this TEI database, so the kinds of complex analysis that we talked about earlier--not necessarily the Proust genetic editions, but this kind of thing--is possible with the data that's being created.&amp;nbsp; It's not idiosyncratic.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-4RtnZj3-520/UX5Krn7CVkI/AAAAAAAAAto/HQsCfLKPuvc/s1600/itinera_nova+-+38.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-4RtnZj3-520/UX5Krn7CVkI/AAAAAAAAAto/HQsCfLKPuvc/s320/itinera_nova+-+38.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So as a result, I really think that in this, Itinera Nova points the way to the future.&amp;nbsp; Which is to &lt;b&gt;abandon this idea that TEI is just for data entry, or that amateurs cannot do mark-up&lt;/b&gt;.&amp;nbsp; Both of those ideas are bogus!&amp;nbsp; Instead, let's say: use TEI for the data model; for the presentation, so we have these beautiful sliders.&amp;nbsp; And whatever else will get created out of the annotation tool, out of the transcription tool, let's use that for the data model and for the presentation.&amp;nbsp; But let's consider let's consider hooking up these--I don't want to say "easier"--but these more straightforward, these more traditional user interfaces [for transcription]. &lt;br /&gt;
&lt;br /&gt;
This is something that I think is really the way forward for crowdsourced transcription.&amp;nbsp; It is being done right now by the Papyrological Editor, it has been done by Itinera Nova for a long time.&amp;nbsp; And there are now some incipient projects to move forward with this.&amp;nbsp; One of these is a new project at the University of Maryland, Maryland Institute for Technology and the Humanities, the Skylark project, in which they are taking those same transcription tools that were used for Old Weather to allow people to mark up and transcribe portions of an image of a literary text that has been heavily annotated--like that Proust--to create data using the data model that can be viewed with tools like the Proust viewer.&lt;br /&gt;
&lt;br /&gt;
So this is, I think, the technical contribution that Itinera Nova is making.&amp;nbsp; Obviously there are a lot more contributions--I mean I'm absolutely stunned by the interaction with the volunteer community that's happening here--but I'm staying on the technical track, so I'm not going to get into that.&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
&lt;a href="http://3.bp.blogspot.com/-tqchb0kGA6I/UX5KeXXRKbI/AAAAAAAAAtU/KVUbFPY336M/s1600/itinera_nova+-+39.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;/a&gt;&lt;br /&gt;
Are there any questions?&amp;nbsp; No?&amp;nbsp; Keep up the great work -- you folks are amazing.&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/x-kLbWrVHu4" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2013/04/itinera-nova-in-worlds-of-crowdsourcing.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/-GlceWoSo9fw/UX5KK_TxXhI/AAAAAAAAApE/Jpx0vqOPdG4/s72-c/itinera_nova+-+01.jpg" height="72" width="72" /><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-6822027080827186327</guid><pubDate>Tue, 26 Feb 2013 20:59:00 +0000</pubDate><atom:updated>2013-02-27T12:38:13.508-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">crowdsourcing</category><category domain="http://www.blogger.com/atom/ns#">similar projects</category><category domain="http://www.blogger.com/atom/ns#">interview</category><title>Ngoni Munyaradzi on Transcribe Bleek and Lloyd</title><description>Ngoni Munyaradzi is a Master's student in Computer Science at the University of Cape Town, South Africa, working on a research project on the transcription of the Digital Bleek and Lloyd collection.  He kindly agreed to an interview over email, which I present below:&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Your website does an excellent job explaining the background and
motivation of &lt;a href="http://boinc.cs.uct.ac.za/transcribe_bushman/"&gt;Transcribe Bleek and Lloyd&lt;/a&gt;.&amp;nbsp; Can you tell us more about the field
notebooks you are transcribing?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
The Digital Bleek and Lloyd Collection is composed of dictionaries,
artwork and notebooks documenting
stories about the earliest inhabitants of Southern Africa, the Bushman
people. The notebooks were written
by &lt;a href="http://en.wikipedia.org/wiki/Wilhelm_Bleek"&gt;Wilhelm Bleek&lt;/a&gt;, his sister-in-law, &lt;a href="http://en.wikipedia.org/wiki/Lucy_Lloyd"&gt;Lucy Lloyd&lt;/a&gt; and Dorothea Bleek
(Wilhelm's daughter) in the 19th century,
with the help of a number of Bushmen people who were prisoners in the
Western Cape region of South Africa
at the time. The notebooks were recorded in the &lt;a href="http://en.wikipedia.org/wiki/%C7%80xam_language"&gt;|Xam&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/%C7%83Kung_language"&gt;!Kun&lt;/a&gt; languages
and English translations of these
languages are available in the notebooks.&lt;br /&gt;
&lt;br /&gt;
Link to the collection: &lt;a href="http://lloydbleekcollection.cs.uct.ac.za/"&gt;http://lloydbleekcollection.cs.uct.ac.za/&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Correct me if I'm wrong, but it seems like at least in the case of
|Xam, you are working with one of the only representatives of an
extinct language.  Are there any standard data models for these kinds
of vocabularies/bilingual texts which you're using?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
There are no complete models - the best known models are still only partial.
&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;I suspect that I'm not alone in wondering why these Bushman people
were prisoners during the writing of these texts.  Can you tell us a
bit more about the Bleek/Lloyd informants, or point us to resources
on the subject?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
The bushman people were prisoners because of petty crimes and a grossly
unfair colonial government.&amp;nbsp; On the Bleek and Lloyd website there is a
story on each contributor.&amp;nbsp; There is information in various books on the
subject as well, but I am not sure there is more that is known than what
is on the website. see:
&lt;br /&gt;
&lt;a href="http://lloydbleekcollection.cs.uct.ac.za/xam.html"&gt;http://lloydbleekcollection.cs.uct.ac.za/xam.html&lt;/a&gt;&lt;br /&gt;
&lt;a href="http://lloydbleekcollection.cs.uct.ac.za/kun.html"&gt;http://lloydbleekcollection.cs.uct.ac.za/kun.html&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;This is the first transcription project I'm aware of using the
&lt;a href="http://boinc.berkeley.edu/trac/wiki/BossaIntro"&gt;Bossa&lt;/a&gt; Crowd Create platform.  What are the factors that led you to
choose that platform and what's been your experience setting it up?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
In 2011 when our project began Bossa was the most mature opensource
crowdsourcing framework that was tailored for volunteer projects
available. Due to this Bossa suited well with the project's requirements.
The alternative crowdsourcing frameworks available at the time used
payment methods.&lt;br /&gt;
&lt;br /&gt;
Setting up the Bossa framework was a relatively straight-forward task.
The documentation online is very thorough and with examples of how
to set-up test applications. I also got assistance from David Anderson
the developer of Bossa.&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;The Bushman writing system seems extremely complex with it's
special characters and multiple diacritics.  I see that you are using
LaTeX macros to encode these complexities.  Why did you decide on
LaTeX and what has been the user response to using that notation?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
So the project is part of ongoing research related to the Bleek and
Lloyd Collection within our
Digital Libraries Laboratory at the University of Cape Town. Credit for
developing the encoding
tool goes to Kyle Williams. And the reason why he chose to use LaTeX was
that; using custom
LaTeX macros allowed for both the problem of the encoding and visual
rendering of the text to
be solved in a single step. Developing a unique font for the Bushman
script is something we
might look at in the future!&lt;br /&gt;
&lt;br /&gt;
Here's a link to a paper published on the encoding tool developed by
Kyle Williams:
&lt;a href="http://link.springer.com/chapter/10.1007%2F978-3-642-24826-9_28"&gt;http://link.springer.com/chapter/10.1007%2F978-3-642-24826-9_28&lt;/a&gt;&lt;br /&gt;
&lt;br /&gt;
Overall the user feedback has been good, as most users are able to
complete transcriptions using the LaTeX macros. We have gotten
suggestions from users to use glyphs to encode the complexities.
Currently the scope of my masters research project does not include
that. There are talks in our research group to develop a unique font to
represent the |Xam and !Kun languages, as this is not supported by
Unicode.&lt;br /&gt;
&lt;br /&gt;
User 1 Comment: "I think the palette handles the complexity of the
character
set very well. This material is inherently difficult to transcribe. The
tool has, on
the whole, been well thought out to meet this challenge. I think it
needs to be
improved in some ways, but considering the difficulties it is remarkably
well done."&lt;br /&gt;
&lt;br /&gt;
User 2 Comment: "VERY intuitive, after a few practice transcriptions. I
actually
enjoyed using the tool after a page was done."&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;This is incredibly useful.  So far as I'm aware, yours is only the
third crowdsourced transcription project that's surveyed users
seriously (after the North American Bird Phenology Project and
Transcribe Bentham).  Do you have any advice on collecting user
feedback at such an early stage?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
Collecting user feedback in the early stages will tremendously
help project administrators determine whether the setup of the project
is easy to follow for participants. One can easily pick up any hindrances
to user participation and address these early. From our project, I've found
that participants can actually suggest very helpful ideas that will make
the data collection process better.
&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Crowdsourced citizen science and cultural heritage projects have
mostly been based in the USA, Northern Europe and Australia until
recently -- in fact, yours is the first that I'm aware of originating
in sub-Saharan Africa.  I'd really like to know which projects
inspired your work with Transcribe Bushman, and what your hopes are
for crowdsourced transcription projects focusing on Africa?
&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
Our work was mostly inspired by the success of GalaxyZoo at recruiting
volunteers, and also the Transcribe Bentham project that explored the
feasibility of volunteers performing transcription. I hope that more
crowdsourced
transcription projects will start-up within Africa in the near future.
What would be
interesting is to see a transcription project for the Timbuktu
manuscripts of Mali.
Beyond transcription, I would like to see other researchers adopting
crowdsourcing
in fields of specialty within Africa.
&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Thanks so much for this interview.  If people want to help out on the
project, what's the best way for them to contribute?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
Interested participants can simply:&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Create an account on the project website.&lt;/li&gt;
&lt;li&gt;Watch a 5 minute video tutorial on how to transcribe the Bushman languages.&lt;/li&gt;
&lt;li&gt;With that, you are ready to start transcribing pages.&lt;/li&gt;
&lt;/ol&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/E5JzQYog6UQ" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2013/02/ngoni-munyaradzi-on-transcribe-bleek.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-5602784745298633630</guid><pubDate>Mon, 25 Feb 2013 14:56:00 +0000</pubDate><atom:updated>2013-02-26T19:34:41.592-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">ocr</category><category domain="http://www.blogger.com/atom/ns#">hackathon</category><title>Detecting Handwriting in OCR Text</title><description>&lt;i&gt;This is my fourth and final post about the iDigBio Augmenting OCR Hackathon.&amp;nbsp; Prior posts covered the &lt;a href="http://manuscripttranscription.blogspot.com/2013/02/idigbio-augmenting-ocr-hackathon.html"&gt;hackathon itself,&lt;/a&gt; my &lt;a href="http://manuscripttranscription.blogspot.com/2013/02/improving-ocr-inputs-from-ocr-outputs.html"&gt;presentation&lt;/a&gt; on preliminary results, and my results &lt;a href="http://manuscripttranscription.blogspot.com/2013/02/results-of-ocrocrop-approach-to.html"&gt;improving the OCR&lt;/a&gt; on entomology specimens.&amp;nbsp; The other participants are&amp;nbsp; slowly adding their results to the &lt;a href="https://www.idigbio.org/wiki/index.php/2013_AOCR_Hackathon_Wiki"&gt;hackathon wiki&lt;/a&gt;, which I recommend checking back with (their efforts were much more impressive than mine).&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-1qHq0VPzwCM/USrM-Xx0lZI/AAAAAAAAAnM/5EUF9ohld84/s1600/TENN-L-0003111_lg.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="245" src="http://1.bp.blogspot.com/-1qHq0VPzwCM/USrM-Xx0lZI/AAAAAAAAAnM/5EUF9ohld84/s320/TENN-L-0003111_lg.jpg" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Clearly handwritten: &lt;i&gt;T&lt;/i&gt;=8, &lt;i&gt;N&lt;/i&gt;=78% from &lt;a href="https://github.com/idigbio-aocr/HandwritingDetection/blob/master/lichen/abby/TENN-L-0003111_lg.txt"&gt;terse&lt;/a&gt; and &lt;a href="https://github.com/idigbio-aocr/HandwritingDetection/blob/master/lichen/tesseract/TENN-L-0003111_lg.txt"&gt;noisy&lt;/a&gt; OCR files&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;br /&gt;
Let's say you have scanned a large number of cards and want to convert them from pixels into data.&amp;nbsp; The cards--which may be bibliography cards, crime reports, or (in this case) labels for lichen specimens--have these important attributes:&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;They contain structured data (e.g. title of book, author, call number, etc. for bibliographies) you want to extract, and&lt;/li&gt;
&lt;li&gt;They were part of a living database built over decades, so some cards are printed, some typewritten, some handwritten, and some with a mix of handwriting and type.&lt;/li&gt;
&lt;/ol&gt;
The structured aspect of the data makes it quite easy to build a web form that asks humans to transcribe what they see on the card images.&amp;nbsp; It also allows for sophisticated techniques for parsing and cleaning OCR (which was the point of the hackathon).&amp;nbsp; The actual keying-in of the images is time consuming and expensive, however, so you don't want to waste human effort on cards which could be processed via OCR.&lt;br /&gt;
&lt;br /&gt;
Since OCR doesn't work on handwriting, how do you know which images to route to the humans and which to process algorithmically?&amp;nbsp; It's simple: any images that contain handwriting should go to the humans.&amp;nbsp; Detecting the handwriting on the images is unfortunately not so simple.&lt;br /&gt;
&lt;br /&gt;
I adopted a quick-and-dirty approach for the hackathon: if OCR of handwriting produces gibberish, why send all the images through a simple pass of OCR and look in the resulting text files for representative gibberish?&amp;nbsp; In my preliminary work, I pulled 1% of our sample dataset (&lt;a href="https://github.com/idigbio-aocr/HandwritingDetection/tree/master/lichen"&gt;all cards ending with "11"&lt;/a&gt;) and classified them three ways:&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Visual inspection of the text files produced by an ABBY OCR engine,&lt;/li&gt;
&lt;li&gt;Visual inspection of the text files produced by the Tesseract OCR engine, and&lt;/li&gt;
&lt;li&gt;Looking at the actual images themselves.&lt;/li&gt;
&lt;/ol&gt;
&lt;a href="http://1.bp.blogspot.com/-MMlIWVCarlw/URz8wMLVhkI/AAAAAAAAAjM/NbLMu9ktnYQ/s1600/HackathonPresentation+-+05.jpg" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="297" src="http://1.bp.blogspot.com/-MMlIWVCarlw/URz8wMLVhkI/AAAAAAAAAjM/NbLMu9ktnYQ/s400/HackathonPresentation+-+05.jpg" width="400" /&gt;&lt;/a&gt;&lt;br /&gt;
To my surprise, I was only able to correctly classify cards from OCR output 80% of the time -- a disappointing finding, since any program I produced to identify handwriting from OCR output could only be less accurate.&amp;nbsp; More interesting was the difference between the kinds of files that ABBY and Tesseract produced.&amp;nbsp; Tesseract produced a lot more gibberish in general--including on card images that were entirely printed.&amp;nbsp; ABBY, on the other hand, scrubbed a lot of gibberish out of its results, including that which might be produced when it encountered handwriting.&lt;br /&gt;
&lt;br /&gt;
This suggested an approach: look at both the "terse" results from ABBY and the "noisy" results from Tesseract to see if I could improve my classification rate.&lt;br /&gt;
&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-Vet6NACqe1A/USrOrGz2JRI/AAAAAAAAAn4/dMKl3jqCrAw/s1600/NY01075911_lg.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="225" src="http://2.bp.blogspot.com/-Vet6NACqe1A/USrOrGz2JRI/AAAAAAAAAn4/dMKl3jqCrAw/s320/NY01075911_lg.jpg" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Easily classified as type-only, despite (non-characteristic) gibberish: &lt;i&gt;T&lt;/i&gt;=0,&lt;i&gt;N&lt;/i&gt;=0 from &lt;a href="https://github.com/idigbio-aocr/HandwritingDetection/blob/master/lichen/abby/NY01075911_lg.txt"&gt;terse&lt;/a&gt; and &lt;a href="https://github.com/idigbio-aocr/HandwritingDetection/blob/master/lichen/tesseract/NY01075911_lg.txt"&gt;noisy&lt;/a&gt; OCR files.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;br /&gt;
But what does it mean to "look" at a file?&amp;nbsp; I wrote a program to loop through each line of an OCR file and check for the kind of gibberish characteristic of OCR and handwriting.&amp;nbsp; Inspecting the files reveals some common gibberish patterns, which we can sum up as regular expressions:&lt;br /&gt;
&lt;code&gt;&lt;/code&gt;&lt;br /&gt;
&lt;pre&gt;&lt;code&gt;GARBAGE_REGEXEN = {
  'Four Dots' =&amp;gt; /\.\.\.\./,
  'Five Non-Alphanumerics' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /\W\W\W\W\W/,
  'Isolated Euro Sign' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /\S€\D/,
  'Double "Low-Nine" Quotes' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /„/,
  'Anomalous Pound Sign' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /£\D/,
  'Caret' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /\^/,
  'Guillemets' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /[«»]/,
  'Double Slashes and Pipes' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /(\\\/)|(\/\\)|([\/\\]\||\|[\/\\])/,
  'Bizarre Capitalization' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /([A-Z][A-Z][a-z][a-z])|([a-z][a-z][A-Z][A-Z])|([A-LN-Z][a-z][A-Z])/,
  'Mixed Alphanumerics' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /(\w[^\s\w\.\-]\w).*(\w[^\s\w]\w)/
}&lt;/code&gt;&lt;/pre&gt;
&lt;br /&gt;
However, some of these expressions match non-handwriting features like geographic coordinates or bar codes.&amp;nbsp; Handling these requires a white list of regular expressions for gibberish we know not to be handwriting:&lt;br /&gt;
&lt;code&gt;&lt;/code&gt;&lt;br /&gt;
&lt;pre&gt;&lt;code&gt;WHITELIST_REGEXEN = {
  'Four Caps' =&amp;gt;&lt;/code&gt;&lt;code&gt;&lt;code&gt;&lt;/code&gt; /[A-Z]{4,}/,
  'Date' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /Date/,
  'Likely year' =&amp;gt;&lt;/code&gt;&lt;code&gt; /1[98]\d\d|2[01]\d\d/,
  'N.S.F.' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /N\.S\.F\.|Fund/,
  'Lat Lon' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /Lat|Lon/,
  'Old style Coordinates' =&amp;gt;&lt;/code&gt;&lt;code&gt;&lt;code&gt;&lt;/code&gt; /\d\d°\s?\d\d['’]\s?[NW]/,
  'Old style Minutes' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /\d\d['’]\s?[NW]/,
  'Decimal Coordinates' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /\d\d°\s?[NW]/,  
  'Distances' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /\d?\d(\.\d+)?\s?[mkf]/,  
  'Caret within heading' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /[NEWS]\^s/,
  'Likely Barcode' =&amp;gt;&lt;/code&gt;&lt;code&gt;&lt;code&gt;&lt;/code&gt; /[l1\|]{5,}/,
  'Blank Line' =&lt;/code&gt;&lt;code&gt;&lt;code&gt;&amp;gt;&lt;/code&gt; /^\s+$/,
  'Guillemets as bad E' =&amp;gt;&lt;/code&gt;&lt;code&gt;&lt;code&gt;&lt;/code&gt; /d«t|pav«aont/  
}&lt;/code&gt;&lt;/pre&gt;
&lt;br /&gt;
With these on hand, we can calculate a score for each file based on the number of occurrences of gibberish we find per line.&amp;nbsp; That score can then be compared against a threshold to determine whether a file contains handwriting.  Due to the noisiness of the Tesseract files, I found it most useful to calculate their score &lt;i&gt;N&lt;/i&gt; as a percentage of non-blank lines, while the score for the terse files &lt;i&gt;T&lt;/i&gt; worked best as a simple count of gibberish matches.&lt;br /&gt;
&lt;table border="1"&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Threshold
&lt;/th&gt;
&lt;th&gt;Correct
&lt;/th&gt;
&lt;th&gt;False&lt;br /&gt;
Positives
&lt;/th&gt;
&lt;th&gt;False&lt;br /&gt;
Negatives
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;

&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i&gt;T&lt;/i&gt; &amp;gt; 1 and &lt;i&gt;N&lt;/i&gt; &amp;gt;  20%&lt;/td&gt;
&lt;td&gt;82%&lt;/td&gt;
&lt;td&gt;&lt;b&gt;10&lt;/b&gt; of 45&lt;/td&gt;
&lt;td&gt;&lt;b&gt;8&lt;/b&gt; of 60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i&gt;T&lt;/i&gt; &amp;gt; 0 and &lt;i&gt;N&lt;/i&gt; &amp;gt;  20%&lt;/td&gt;
&lt;td&gt;84%&lt;/td&gt;
&lt;td&gt;&lt;b&gt;13&lt;/b&gt; of 45&lt;/td&gt;
&lt;td&gt;&lt;b&gt;4&lt;/b&gt; of 60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i&gt;T&lt;/i&gt; &amp;gt; 1&lt;/td&gt;
&lt;td&gt;79%&lt;/td&gt;
&lt;td&gt;&lt;b&gt;10 of&lt;/b&gt; 45&lt;/td&gt;
&lt;td&gt;&lt;b&gt;12&lt;/b&gt; of 60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i&gt;N&lt;/i&gt; &amp;gt; 20%&lt;/td&gt;
&lt;td&gt;75%&lt;/td&gt;
&lt;td&gt;&lt;b&gt;8&lt;/b&gt; of 45&lt;/td&gt;
&lt;td&gt;&lt;b&gt;18&lt;/b&gt; of 60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;i&gt;N&lt;/i&gt; &amp;gt; 10%&lt;/td&gt;
&lt;td&gt;81%&lt;/td&gt;
&lt;td&gt;&lt;b&gt;14&lt;/b&gt; of 45&lt;/td&gt;
&lt;td&gt;&lt;b&gt;6&lt;/b&gt; of 60&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
One interesting thing about this approach is that adjusting the thresholds lets us tune the classifications for resources and desired quality.  If our humans doing data entry are particularly expensive or impatient, raising the thresholds should ensure that they are only very rarely sent typed text.  On the other hand, lowering the thresholds would increase the human workload while improving quality of the resulting text.&lt;br /&gt;
&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-P5kdZQYOu9M/USrLNz2f3rI/AAAAAAAAAnA/oQbB_eHyXOk/s1600/WIS-L-0013011_lg.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="213" src="http://3.bp.blogspot.com/-P5kdZQYOu9M/USrLNz2f3rI/AAAAAAAAAnA/oQbB_eHyXOk/s320/WIS-L-0013011_lg.jpg" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;One of the&amp;nbsp; false negatives: &lt;i&gt;T&lt;/i&gt;=0, &lt;i&gt;N&lt;/i&gt;=10% from parsing &lt;a href="https://github.com/idigbio-aocr/HandwritingDetection/blob/master/lichen/abby/WIS-L-0013011_lg.txt"&gt;terse&lt;/a&gt; and &lt;a href="https://github.com/idigbio-aocr/HandwritingDetection/blob/master/lichen/tesseract/WIS-L-0013011_lg.txt"&gt;noisy&lt;/a&gt; text files.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;br /&gt;
I'm really pleased with this result.&amp;nbsp; The combined classifications are slightly better than I was able to accomplish by looking at the OCR myself.&amp;nbsp; The experience of a volunteer presented with 56 images containing handwriting and 13 which don't may necessitate a "send to OCR" button in the user interface, but must be less frustrating than the unclassified ratio of 45 in 105 from the sample set.&amp;nbsp; With a different distribution of handwriting-to-type in the dataset, the process might be very useful for extracting rare typed material from a mostly-handwritten set, or vice versa. &lt;br /&gt;
&lt;br /&gt;
All of the datasets, code, and scored CSV files are in iDigBio AOCR Hackathon's&lt;a href="https://github.com/idigbio-aocr/HandwritingDetection"&gt; HandwritingDetection reposity&lt;/a&gt; on GitHub..&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/z4v1zXVIfA4" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2013/02/detecting-handwriting-in-ocr-text.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/-1qHq0VPzwCM/USrM-Xx0lZI/AAAAAAAAAnM/5EUF9ohld84/s72-c/TENN-L-0003111_lg.jpg" height="72" width="72" /><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-1288323735824653084</guid><pubDate>Fri, 15 Feb 2013 22:21:00 +0000</pubDate><atom:updated>2013-02-26T19:35:24.077-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">ocr</category><category domain="http://www.blogger.com/atom/ns#">hackathon</category><title>Results of the "Ocrocrop" Approach to Improving OCR</title><description>This project attempted to improve the quality of OCR applied to difficult
entomology images[*] by cropping labels from the images to run through OCR
separately.  In order to identify labels on the image to crop, an initial, 
'naive' pass of OCR was made over the whole image, generating both&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;A) a set of rectangles on the image defined as word bounding boxes by
the OCR engine, and&amp;nbsp;&lt;/li&gt;
&lt;li&gt;B) a control OCR text file to be used for comparing the 'naive' model with
the methodology.&lt;/li&gt;
&lt;/ul&gt;
Those word rectangles were then filtered, consolidated, and filtered again
to identify the labels on the image, which were then extracted and run through
the OCR engine separately.  The resulting OCR output files were then concatenated
into a single text file, which was compared against the 'naive' output described
in A (above).&lt;br /&gt;
&lt;br /&gt;
I'll call this method "ocrocrop".  (For more detail on method, see the 
&lt;a href="http://manuscripttranscription.blogspot.com/2013/02/improving-ocr-inputs-from-ocr-outputs.html"&gt;transcript of my preliminary presentation.&lt;/a&gt;)&lt;br /&gt;
&lt;br /&gt;
The results were encouraging.  

(See &lt;a href="https://github.com/idigbio-aocr/LabelExtraction/blob/master/metrics/metrics.csv"&gt;CSV file listing results for each file&lt;/a&gt;, and the &lt;a href="https://github.com/idigbio-aocr/LabelExtraction/tree/master/metrics"&gt;directory&lt;/a&gt; containing "naive" output, annotated JPGs, and cleaned output files for each test.)&lt;br /&gt;
&lt;br /&gt;
Of 80 files tested, 20 experienced a 
decrease in score (see &lt;a href="https://docs.google.com/document/d/1rH5a8Qee2hwzlGZPisEr5-tV8nmDa50j1dxyUu0pY84/edit"&gt;Alex Thomson's scoring service&lt;/a&gt;), but most (14/20) of those were on
OCR output below 10% accuracy in the first place, and the remainder 
were at or below 20% accuracy.  So it is reasonable to say that the
ocrocrop method only degraded the quality of texts that were unusable
in the first place.&lt;br /&gt;
&lt;br /&gt;
40 of the 80 files tested showed more promising results, showing improvements
from one to twenty percentage points -- in some cases only marginally improving
unusable (below 10% accurate) outputs, but in many cases improving the scores
more substantially (say from 25% to 35% in the case of EMEC609908_Stigmus_sp).&lt;br /&gt;
&lt;br /&gt;
Most of the top quartile of results saw improvements on texts that were already
scoring above 10% accuracy rates (16 of 20), so it appears that the effectiveness
of the ocrocrop method is correlated to the quality of the naive input data --
garbage is degraded or only minimally improved, while OCR that is merely bad
under the naive approach can be significantly improved.&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="https://raw.github.com/idigbio-aocr/LabelExtraction/master/metrics/EMEC609928_Stigmus_sp.hocr.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="https://raw.github.com/idigbio-aocr/LabelExtraction/master/metrics/EMEC609928_Stigmus_sp.hocr.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
The ocrocrop method saw the greatest improvement in cases
where the naive OCR pass was effective at identifying word bounding boxes,
but ineffective at translating their contents into words.  Taking
EMEC609928_Stigmus_sp, the case of greatest improvement (naive: 18.9%, ocrocrop: 70.5%),
we see that all words on the labels except for the collector name were recognized as
words (in purple), making the cropped label images (in blue) good representatives of the 
actual labels on the image.&lt;br /&gt;
&lt;br /&gt;
The cropped image was more easily processed by our OCR image, so that we may
compare the naive version of the second label:
&lt;br /&gt;
&lt;pre&gt; CALIF:Hunbo1dt Co. ;‘ ~
 3 m1.N' Garbervﬂle ,::f&amp;lt; '_- '
 v—23~75 n.n1e:z.' 9 ._ ’&lt;/pre&gt;
with the ocrocrop version of the second label:
&lt;br /&gt;
&lt;pre&gt; CALIF:Humboldt Co.
 3 mi.N Garberville
 V-23-76 R.Dietz,'&lt;/pre&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="https://raw.github.com/idigbio-aocr/LabelExtraction/master/metrics/EMEC609651_Cerceris_completa.hocr.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="https://raw.github.com/idigbio-aocr/LabelExtraction/master/metrics/EMEC609651_Cerceris_completa.hocr.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
One of the problems with the OCR-based pre-processing which may be hidden 
by the scores is that many labels are entirely missed by the ocrocrop if the 
first, naive OCR pass failed to identify any words at all on the label.  In cases 
such as EMEC609651_Cerceris_completa, the determination label was not cropped (indicated by blue rectangles) because
no words (purple rectangles) were detected by the original.  As a result, 
while the ocrocrop OCR is an improvement over the naive OCR (6.6% vs. 6.5%), 
substantial portions of text on the image are unimproved because they are 
unattempted.&lt;br /&gt;
&lt;br /&gt;
There are two possible ways to solve this problem.  One is to abandon the
ocrocrop model entirely, switching back to a computer vision approach -- 
either by programmatically locating rectangles on the image (as Phuc Nguyen demonstrated)
 or by asking humans to identify regions of interest for OCR processing (as 
 demonstrated by Jason Best in Apiary and by Paul and Robin Schroeder in ScioTR).  The other 
 option is to improve the naive OCR -- perhaps by swapping out the engine
 (e.g. use ABBY instead of Tesseract), perhaps by using a different image
 pre-processor (like ocropus's front-end to Tesseract), perhaps by re-training
 Tesseract.&lt;br /&gt;
&lt;br /&gt;
I suspect that a computer vision approach to extracting entomology labels (or 
similar pieces of paper photographed against a noisy background) will provide 
a more effective eventual solution than the ocrocrop method.  Nevertheless, 
the ocrocrop "bang it with a rock until it works" approach has a lot of potential
to take entomology-style OCR &lt;b&gt;to&lt;/b&gt; bad &lt;b&gt;from&lt;/b&gt; worse.&lt;br /&gt;
&lt;br /&gt;
[*]In addition to the difficulties typical of specimen labels--mix of typefaces,
handwritten material, typewritten material, text inventory with few overlaps with
a dictionary of literary English--the entomology dataset contained additional
challenges.  Difficulties included the following:&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;Images containing specimens and rulers as well as labels.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Labels casually arranged for photography, so that text orientation was not necessarily aligned.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Labels photographed against a background of heavily pin-pricked styrofoam rather than a black or neutral background.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;3-d images including what appear to be shadows, which soften the contrast differences around borders.
   
&lt;/li&gt;
&lt;/ul&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/uj1oqhYITP4" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2013/02/results-of-ocrocrop-approach-to.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-5982152672460529955</guid><pubDate>Fri, 15 Feb 2013 22:00:00 +0000</pubDate><atom:updated>2013-02-26T19:35:50.652-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">ocr</category><category domain="http://www.blogger.com/atom/ns#">hackathon</category><title>iDigBio Augmenting OCR Hackathon</title><description>I spent the last three days at the &lt;a href="https://www.idigbio.org/wiki/index.php/2013_AOCR_Hackathon_Wiki"&gt;iDigBio Augmenting OCR Hackathon&lt;/a&gt; working alongside mycologists, botanists, entomologists, herbarium managers, and bioinformaticians to explore ways to improve parsing of digitized specimen labels.&amp;nbsp; While I'm pleased with the results of my own contribution, I'd like to take a minute to talk about the hackathon process itself before I post them.&lt;br /&gt;
&lt;br /&gt;
This was my first hackathon--a condition which seemed to be the rule among the participants--and I was really impressed with it.&amp;nbsp; The iDigBio folks defined a clear set of goals (improve OCR parsing of specimen labels) with clear metrics (these datasets, these output formats, this scoring algorithm) a couple of months beforehand, and organized five weekly videoconferences before the event.&amp;nbsp; Most important of all, the participants were encouraged to prepare a 10-minute lightning talk on their efforts and preliminary results.&amp;nbsp; (See below for the &lt;a href="http://manuscripttranscription.blogspot.com/2013/02/improving-ocr-inputs-from-ocr-outputs.html"&gt;transcript of my talk&lt;/a&gt;, see the &lt;a href="https://docs.google.com/document/d/1rH5a8Qee2hwzlGZPisEr5-tV8nmDa50j1dxyUu0pY84/edit"&gt;notes document&lt;/a&gt; for descriptions of all talks.)&lt;br /&gt;
&lt;br /&gt;
In my opinion, these preliminary talks were critical to the success of the project.&amp;nbsp; The preliminary nature relaxed pressure on participants, so we were able to experiment beyond the target of the hackathon (as I did with my handwriting detection digression, a related, but un-scorable effort).&amp;nbsp; On the other hand, they did provide enough impetus to get many of us looking at the data, working with the tools, and thinking about approaches.&amp;nbsp; This meant that even before the hackathon started, many of us were familiar enough with the materials to have a real 'meeting of the minds' experience during the pre-event supper:&amp;nbsp; &lt;i&gt;"Did you just say 'the contrast difference between the print and the label is higher than the difference between the label and the background'?&amp;nbsp; We ran into that too, and here's what we did..."&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
The experience was a real education in OCR for me, and I feel like I picked up techniques I can apply directly to projects I've discussed with clients and potential clients.&amp;nbsp; In particular, I got a real appreciation for how interrelated image preparation, OCR, and parsing are to each other.&amp;nbsp; One participant had created separate libraries of regular expressions to clean up each kind of field, having discovered that latitude/longitude coordinates require different error correction than personal names or herbarium catalog numbers do.&amp;nbsp; Another group had built a touch-screen tool for classifying segments of the image before submitting them to OCR.&amp;nbsp; My own project required a first pass of OCR to clean images before sending them to a second, 'real' pass of OCR.&amp;nbsp; A simple 1,2,3 workflow just isn't sufficient!&lt;br /&gt;
&lt;br /&gt;
&lt;a href="https://www.idigbio.org/content/about-idigbio"&gt;iDigBio&lt;/a&gt; itself is an NSF-funded attempt to advance digitization practices on natural history collections, combining disciplinary "thematic collection networks" and methodologically focused working groups on topics like georeferencing, crowdsourcing, and OCR.&amp;nbsp; Aware that they're not the only people digitizing things, they have been reaching out beyond the natural sciences to the library and information science community at the &lt;a href="https://www.idigbio.org/wiki/index.php/Five_iConference2013_Talks"&gt;iConference&lt;/a&gt; this year.&amp;nbsp; This rejection of "not invented here" siloing was a big part of the hackathon, and I hope that more people from outside the natural sciences will get involved.&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/F47gdvy2Tso" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2013/02/idigbio-augmenting-ocr-hackathon.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-8413741380955571419</guid><pubDate>Thu, 14 Feb 2013 15:32:00 +0000</pubDate><atom:updated>2013-02-26T19:36:41.209-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">presentations</category><category domain="http://www.blogger.com/atom/ns#">ocr</category><category domain="http://www.blogger.com/atom/ns#">hackathon</category><title>Improving OCR Inputs from OCR Outputs?</title><description>This is a transcript of my talk at the &lt;a href="https://www.idigbio.org/wiki/index.php/IConference_2013_iDigBio_AOCR_WG_Wiki"&gt;iDigBio Augmenting OCR Hackathon&lt;/a&gt;, presenting preliminary results of my efforts before the event.&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-Sk1_ud5Sm5k/URz8vupvm8I/AAAAAAAAAi8/-QvKR7DSmlw/s1600/HackathonPresentation+-+01.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://2.bp.blogspot.com/-Sk1_ud5Sm5k/URz8vupvm8I/AAAAAAAAAi8/-QvKR7DSmlw/s320/HackathonPresentation+-+01.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
For my preliminary work, I tried to improve the inputs to our OCR process through looking at the outputs of a naive OCR.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-zELTfJgfG4U/URz8vlRNaNI/AAAAAAAAAi0/pF5a2g8PFvo/s1600/HackathonPresentation+-+02.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://1.bp.blogspot.com/-zELTfJgfG4U/URz8vlRNaNI/AAAAAAAAAi0/pF5a2g8PFvo/s320/HackathonPresentation+-+02.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
One of the first things that we can do to improve the quality of our inputs to OCR is to not feed them handwriting.&amp;nbsp; To quote Homer Simpson, "Remember son, if you don't try, you can't fail."&amp;nbsp; So let's not try feeding our OCR processes handwritten materials.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-NVqW703Kbgo/URz8vie72VI/AAAAAAAAAi4/f7q0Qlnlkoo/s1600/HackathonPresentation+-+03.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://3.bp.blogspot.com/-NVqW703Kbgo/URz8vie72VI/AAAAAAAAAi4/f7q0Qlnlkoo/s320/HackathonPresentation+-+03.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
To do this, we need to try to detect the presence of handwriting.&amp;nbsp; When you try to feed handwriting to OCR, you get a lot of gibberish.&amp;nbsp; If we can detect handwriting, we can route some of our material to "humans in the loop" -- not wasting their time with things we could be OCRing.&amp;nbsp; So how do we do this?&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-aDRuN7yudaY/URz8wK8SE3I/AAAAAAAAAjA/nJtKXHwOMYA/s1600/HackathonPresentation+-+04.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://3.bp.blogspot.com/-aDRuN7yudaY/URz8wK8SE3I/AAAAAAAAAjA/nJtKXHwOMYA/s320/HackathonPresentation+-+04.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
My approach was to use the outputs of [naive] OCR to detect the gibberish it produces when it sees handwriting to try to determine when there was handwriting present in the images.&amp;nbsp; The first thing I did before I started programming, was classifying OCR output from the lichen samples by visual inspection: whether I thought there was hand writing present or not, based on looking at the OCR outputs.&amp;nbsp; Step two was to automate the classifications.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-MMlIWVCarlw/URz8wMLVhkI/AAAAAAAAAjM/NbLMu9ktnYQ/s1600/HackathonPresentation+-+05.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://1.bp.blogspot.com/-MMlIWVCarlw/URz8wMLVhkI/AAAAAAAAAjM/NbLMu9ktnYQ/s320/HackathonPresentation+-+05.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
I tried this initially on the results that came out of ABBY and then the results that came out of Tesseract, and I was really surprised by how hard it was for me as a human to spot gibberish.&amp;nbsp; I could spot it, but in a lot of cases -- ABBY does a great job of cleaning up its OCR output -- so in a lot of cases, particularly the labels that were all printed with the exception of some species name that was handwritten, ABBY generally misses those.&amp;nbsp; Tesseract, on the other hand, does not produce outputs that are quite as clean.&lt;br /&gt;
&lt;br /&gt;
So the really interesting thing about this to me is that while we were able to get 70-75% accuracy on both ABBY and Tesseract, if you look at the difference between the false positives that come out of ABBY and Tesseract and the false negatives, I think there is some real potential here for making a much more sophisticated algorithm.&amp;nbsp; Maybe the goal is to pump things through ABBY for OCR, but beforehand look at Tesseract output to determine whether there is handwriting or not.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-w2f1W3DurAI/URz8wYYm3cI/AAAAAAAAAjE/SwcBIBhcsNo/s1600/HackathonPresentation+-+06.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://2.bp.blogspot.com/-w2f1W3DurAI/URz8wYYm3cI/AAAAAAAAAjE/SwcBIBhcsNo/s320/HackathonPresentation+-+06.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
The next thing I did was try to automate this.&amp;nbsp; I just used some regular expressions to look for representative gibberish, and then based on the number of matches got results that matched the visual inspection, though you do get some false positives.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-iBu0mT3ScRU/URz8whCcA4I/AAAAAAAAAjI/vDNnpFzpErg/s1600/HackathonPresentation+-+07.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://1.bp.blogspot.com/-iBu0mT3ScRU/URz8whCcA4I/AAAAAAAAAjI/vDNnpFzpErg/s320/HackathonPresentation+-+07.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
The next thing I want to do with this is to come up with a way to filter
 the results based on doing a detection on ABBY [output] and doing a 
detection on Tesseract [output].&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-_b429XTiHVA/URz8w_NlEPI/AAAAAAAAAjQ/Gk493rnzJgc/s1600/HackathonPresentation+-+08.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://4.bp.blogspot.com/-_b429XTiHVA/URz8w_NlEPI/AAAAAAAAAjQ/Gk493rnzJgc/s320/HackathonPresentation+-+08.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
The next thing that I wanted to work on was label extraction.&lt;br /&gt;
&lt;br /&gt;
We're all familiar with the entomology labels and problems associated with them.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-Kygc6ZH8AbE/URz8xPU_TEI/AAAAAAAAAjg/-76lboCZF8Q/s1600/HackathonPresentation+-+09.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://3.bp.blogspot.com/-Kygc6ZH8AbE/URz8xPU_TEI/AAAAAAAAAjg/-76lboCZF8Q/s320/HackathonPresentation+-+09.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So if you pump that image of &lt;i&gt;Cerceris&lt;/i&gt; through Tesseract, you end up with a lot of garbage. You end up with a lot of gibberish, a lot of blank lines, some recognizable words.&amp;nbsp; That "Cerceris compacta" is, I believe, the result of a post-digitzation process: it looks like an artifact of somebody using Photoshop or ImageMagick to add labels to the image.&amp;nbsp; The rest of it is the actual label contents, and it's pretty horrible.&amp;nbsp; We've all stared at this; we've all seen it.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-z-DurWlq-4k/URz8xCfrjmI/AAAAAAAAAjc/YX56f10VsQ4/s1600/HackathonPresentation+-+10.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://2.bp.blogspot.com/-z-DurWlq-4k/URz8xCfrjmI/AAAAAAAAAjc/YX56f10VsQ4/s320/HackathonPresentation+-+10.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So how do you sort the labels in these images from rulers, holes in styrofoam, and bugs?&amp;nbsp; I tried a couple of approaches.&amp;nbsp; I first tried to traverse the image itself, looking for&amp;nbsp; contrast differences between the more-or-less white labels and their backgrounds.&amp;nbsp; The problem I found with that was that the highest contrast regions of the image are the difference between print and the labels behind the print.&amp;nbsp; So you're looking for a fairly low-contrast difference--and there are shadows involved.&amp;nbsp; Probably, if I had more math I could do this, but this was too hard.&lt;br /&gt;
&lt;br /&gt;
So my second try was to use the output of OCR that produces these word bounding boxes to determine where labels might be, because labels have words on them.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-UfaX_0HxFD8/URz8xf2rSPI/AAAAAAAAAjY/Kk40eeSYtm8/s1600/HackathonPresentation+-+11.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://4.bp.blogspot.com/-UfaX_0HxFD8/URz8xf2rSPI/AAAAAAAAAjY/Kk40eeSYtm8/s320/HackathonPresentation+-+11.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
If you run Tesseract or Ocropus with an "hocr" option, you get these pseudo-HTML files that have bounding boxes around the text.&amp;nbsp; Here you see this text element inside a span; the span has these HTML attributes that say "this is an OCR word".&amp;nbsp; Most importantly, you have the title attribute as the bounding box definition of a rectangle.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-opK46w_pvHA/URz8x9-Sv7I/AAAAAAAAAjo/U9lwWBkcocc/s1600/HackathonPresentation+-+12.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://3.bp.blogspot.com/-opK46w_pvHA/URz8x9-Sv7I/AAAAAAAAAjo/U9lwWBkcocc/s320/HackathonPresentation+-+12.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
If you extract that and re-apply it to an image, you see that there are a lot of rectangles on the image, but not all the rectangles are words.&amp;nbsp; You've got bees, you've got rulers; you've got a lot of random trash in the styrofoam.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-YSyi1njD2F4/URz8yIOUdII/AAAAAAAAAjw/oNdQt6zwHF0/s1600/HackathonPresentation+-+13.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://4.bp.blogspot.com/-YSyi1njD2F4/URz8yIOUdII/AAAAAAAAAjw/oNdQt6zwHF0/s320/HackathonPresentation+-+13.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So how do we sort good rectangles from bad rectangles?&amp;nbsp; First I did a pass looking at the OCR text itself.&amp;nbsp; If the bounding box was around text that looked like a word, I decided that this was a good rectangle.&amp;nbsp; Next, I did a pass by size.&amp;nbsp; A lot of the dots in the stryofoam come out looking suspiciously word-like for reasons I don't understand.&amp;nbsp; So if the area of the rectangle was smaller than .015% of the image, I threw it away. &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-mrm-6-xuqa4/URz8yNoMOLI/AAAAAAAAAj0/YVQTOfHEmYQ/s1600/HackathonPresentation+-+14.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://3.bp.blogspot.com/-mrm-6-xuqa4/URz8yNoMOLI/AAAAAAAAAj0/YVQTOfHEmYQ/s320/HackathonPresentation+-+14.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
The result was [above]: you see rectangles marked with green that pass my filter and rectangles marked with red that don't.&amp;nbsp; So you get rid of the bee, you get rid of part of the ruler -- more important, you get rid of a lot of the trash over here. [Pointing to small red rectangles on styrofoam.] There are some bugs in this--we end up getting rid of "Arizona" for reasons I need to look at--but it does clean the thing up pretty nicely.&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Question:&lt;/i&gt; A very simple solution to this would be for the guys at Berkeley to take two photographs -- one of the bee and ruler, one of the labels.&amp;nbsp; I'm just thinking how much simpler that would be.&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Me:&lt;/i&gt; If the guys in Berkeley had a workflow that took the picture--even with the bee--agaist a black background, that would trivialize this problem completely!&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Question:&lt;/i&gt; If the photos were taken against a background of wallpaper with random letters, it couldn't be much worse than this [styrofoam].&amp;nbsp; The idea is that you could make this a lot easier if you would go to the museums and say, we'll participate, we'll do your OCRing, but you must take photographs this way.&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Me:&lt;/i&gt; You're absolutely right.&amp;nbsp; You could even hand them a piece of cardboard that was a particular color and say, "Use this and we'll do it for you, don't use it and we won't."&amp;nbsp; I completly agree.&amp;nbsp; But this is what we're starting with, so this is what I'm working on.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-o3CcCwIhjnM/URz8yRM3D1I/AAAAAAAAAj8/pc5mmgBLO2k/s1600/HackathonPresentation+-+15.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://1.bp.blogspot.com/-o3CcCwIhjnM/URz8yRM3D1I/AAAAAAAAAj8/pc5mmgBLO2k/s320/HackathonPresentation+-+15.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
The next thing is to aggregate all those word boxes into the labels [they constitute]. For each rectangle, look at all of the other rectangles in the system, expand them both a little bit, determine if they overlap, and if they do, consolidate them into a new rectangle, and repeat the process until there are no more consolidations to be done. &lt;i&gt;[Thanks to Sara Brumfield for this algorithm.]&lt;/i&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-Oif2nY6ndYE/URz8ygytdgI/AAAAAAAAAkE/VezUnKzuZdA/s1600/HackathonPresentation+-+16.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://4.bp.blogspot.com/-Oif2nY6ndYE/URz8ygytdgI/AAAAAAAAAkE/VezUnKzuZdA/s320/HackathonPresentation+-+16.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
If you do that, the blue boxes are the consolidated rectangles.&amp;nbsp; Here you see a rectangle around the U.C. Berkeley label, a rectangle around the collector, and a pretty glorious rectangle around the determination that does not include the border.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-F1RD8W1_B04/URz8y06m-VI/AAAAAAAAAkA/84pG5nyKoOA/s1600/HackathonPresentation+-+17.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://3.bp.blogspot.com/-F1RD8W1_B04/URz8y06m-VI/AAAAAAAAAkA/84pG5nyKoOA/s320/HackathonPresentation+-+17.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Having done that, you want to further filter those rectangles.&amp;nbsp; Labels contain words, so you can reject any rectangles that were "primitives" -- you can get rid of the ruler rectangle, for example, because it was just a single [primitive] rectangle that was pretty large.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-RcKK07OJjc4/URz8zKTCibI/AAAAAAAAAkM/YLSCTJ_A9gI/s1600/HackathonPresentation+-+18.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://2.bp.blogspot.com/-RcKK07OJjc4/URz8zKTCibI/AAAAAAAAAkM/YLSCTJ_A9gI/s320/HackathonPresentation+-+18.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So you make sure that all of your rectangles were created through consolidation, then you crop the results.&amp;nbsp; And you end up automatically extracting these images from that sample -- some of which are pretty good, some of which are not.&amp;nbsp; We've got some extra trash here, we cropped the top of "Arizona" here.&amp;nbsp; But for some of the labels -- I don't think I could do better than that determination label by hand.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-o5iZI1Qo99k/URz8zMFAAqI/AAAAAAAAAkI/WHBXDtPcPv8/s1600/HackathonPresentation+-+19.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://4.bp.blogspot.com/-o5iZI1Qo99k/URz8zMFAAqI/AAAAAAAAAkI/WHBXDtPcPv8/s320/HackathonPresentation+-+19.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Then you feed the results back into Tesseract one by one, then we combine the text files in Y-axis order to produce a single file for all those images.&amp;nbsp; (Not something that's a necessary step, but that does allow us to compare the results with the "raw" OCR.)&amp;nbsp; How did we do?&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-dYiLSY-HSl8/URz8zgEK2bI/AAAAAAAAAkY/UVdXww83aLQ/s1600/HackathonPresentation+-+20.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://1.bp.blogspot.com/-dYiLSY-HSl8/URz8zgEK2bI/AAAAAAAAAkY/UVdXww83aLQ/s320/HackathonPresentation+-+20.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
This is a resulting text file -- we've got a date that's pretty recognizable, we've got a label that's recognizable, and the determination is pretty nice.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-1quej69m-LY/URz8zvtmKqI/AAAAAAAAAkU/l9psSkMIUkE/s1600/HackathonPresentation+-+21.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://1.bp.blogspot.com/-1quej69m-LY/URz8zvtmKqI/AAAAAAAAAkU/l9psSkMIUkE/s320/HackathonPresentation+-+21.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Let's compare it to the raw result.&amp;nbsp; In the cropped results, we somehow missed the "Cerceris compacta", we did a much nicer job on the date, and the determination is actually pretty nice.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-ef_UAF8-qPw/URz8zsn1WeI/AAAAAAAAAkg/Feaw0yIrMD0/s1600/HackathonPresentation+-+22.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://2.bp.blogspot.com/-ef_UAF8-qPw/URz8zsn1WeI/AAAAAAAAAkg/Feaw0yIrMD0/s320/HackathonPresentation+-+22.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Let's try it on a different specimen image.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-sNAVDp_DSAU/URz8z7xccGI/AAAAAAAAAkc/bHtcSWqUoeM/s1600/HackathonPresentation+-+23.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://2.bp.blogspot.com/-sNAVDp_DSAU/URz8z7xccGI/AAAAAAAAAkc/bHtcSWqUoeM/s320/HackathonPresentation+-+23.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
We run the same process over this &lt;i&gt;Stigmus&lt;/i&gt; image.&amp;nbsp; We again find labels pretty well.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-l_pdD96TFsM/URz80AcP82I/AAAAAAAAAkk/tFA40m10etA/s1600/HackathonPresentation+-+24.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://4.bp.blogspot.com/-l_pdD96TFsM/URz80AcP82I/AAAAAAAAAkk/tFA40m10etA/s320/HackathonPresentation+-+24.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&amp;nbsp;When we crop them out, the autocrop pulls them out into these three images. &lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-UiD8Y27Q3as/URz803t0zeI/AAAAAAAAAk8/8CmOkV0RBBI/s1600/HackathonPresentation+-+26.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://1.bp.blogspot.com/-UiD8Y27Q3as/URz803t0zeI/AAAAAAAAAk8/8CmOkV0RBBI/s320/HackathonPresentation+-+26.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Running those images through OCR, we get a comparison of the original, 
which had a whole lot of gibberish.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-0vfGrUAgn7o/URz800_XRtI/AAAAAAAAAk0/iImjQJdtAyY/s1600/HackathonPresentation+-+27.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://2.bp.blogspot.com/-0vfGrUAgn7o/URz800_XRtI/AAAAAAAAAk0/iImjQJdtAyY/s320/HackathonPresentation+-+27.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
The original did a decent job with 
the specimen number, but the autocrop version does as well.&amp;nbsp; In 
particular, for this location [field], the autocrop version is nearly 
perfect, whereas the original is just a mess.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-U-hxXcbPu9A/URz80-17tUI/AAAAAAAAAk4/CRzshzUn7XM/s1600/HackathonPresentation+-+28.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://4.bp.blogspot.com/-U-hxXcbPu9A/URz80-17tUI/AAAAAAAAAk4/CRzshzUn7XM/s320/HackathonPresentation+-+28.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
My conclusion is that we can extract labels fairly effectly by first doing a naive pass of OCR and looking at the results of that, and that the results of OCR over the cropped images is less horrible than running OCR over the raw images -- though still not great.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-UsyyG_9wOd4/URz81aQX8aI/AAAAAAAAAlA/QyEpOtalwJU/s1600/HackathonPresentation+-+29.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="239" src="http://2.bp.blogspot.com/-UsyyG_9wOd4/URz81aQX8aI/AAAAAAAAAlA/QyEpOtalwJU/s320/HackathonPresentation+-+29.jpg" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;i&gt;[2013-02-15 update: See &lt;a href="http://manuscripttranscription.blogspot.com/2013/02/results-of-ocrocrop-approach-to.html"&gt;the results of this approach&lt;/a&gt; and &lt;a href="http://manuscripttranscription.blogspot.com/2013/02/idigbio-augmenting-ocr-hackathon.html"&gt;my write-up&lt;/a&gt; of the iDigBio Augmenting OCR Hackathon itself.]&lt;/i&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/lrLA7GEV2u4" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2013/02/improving-ocr-inputs-from-ocr-outputs.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/-Sk1_ud5Sm5k/URz8vupvm8I/AAAAAAAAAi8/-QvKR7DSmlw/s72-c/HackathonPresentation+-+01.jpg" height="72" width="72" /><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-5248077440103758647</guid><pubDate>Sat, 10 Nov 2012 14:57:00 +0000</pubDate><atom:updated>2013-02-26T19:37:12.672-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">presentations</category><category domain="http://www.blogger.com/atom/ns#">similar projects</category><category domain="http://www.blogger.com/atom/ns#">tei</category><title>What does it mean to "support TEI" for manuscript transcription?</title><description>This is a transcript of my talk at the &lt;a href="http://idhmc.tamu.edu/teiconference/"&gt;2012 TEI meeting&lt;/a&gt; at Texas A&amp;amp;M University, "What does it mean to 'support TEI' for manuscript transcription: a tool-maker's perspective."
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-O2VRdNayzDg/UJ5OwjBvK7I/AAAAAAAAAco/SCp3phizT-0/s1600/Slide1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-O2VRdNayzDg/UJ5OwjBvK7I/AAAAAAAAAco/SCp3phizT-0/s320/Slide1.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
You can download &lt;a href="http://dev.aspengrovefarm.com/benwbrum/talks/brumfield_tei_2012.mp3"&gt;an MP3 recording of the talk here.&lt;/a&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-omvgJTU41yQ/UJ5PGJ73teI/AAAAAAAAAfY/qLS7btk28m4/s1600/Slide3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-omvgJTU41yQ/UJ5PGJ73teI/AAAAAAAAAfY/qLS7btk28m4/s320/Slide3.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Let's get started with a couple of definitions.&amp;nbsp; All the tools and the sites that I'm reviewing are cloud based, which means that I'm ruling out--perhaps arbitrarily--any projects that involve people doing offline edition and then publishing that on the web.&amp;nbsp; I'm only talking about online-based tools.&lt;br /&gt;
&lt;br /&gt;
So that's a very strict definition of clouds, and I'm going to have a very loose and squishy definition of crowds, in which I'm talking about any sort of tool that allows collaborative editing of manuscript material, and not just ones that are directed at amateurs.&amp;nbsp; That's important for a couple of reasons: one, because it gave me a sample size that was large enough to find out how people are using TEI, but--for another reason--because "amateurs" aren't really amateurs.&amp;nbsp; What we see with crowdsourcing projects is that amateurs become experts very quickly.&amp;nbsp; And given that your average user of any citizen science or historical crowdsourcing project is a woman over 50 who has at least a Master's degree, this isn't sort of the unwashed masses.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-1nydaD2u4cQ/UJ5POUiG9iI/AAAAAAAAAgw/7BbLxJM35CM/s1600/Slide4.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-1nydaD2u4cQ/UJ5POUiG9iI/AAAAAAAAAgw/7BbLxJM35CM/s320/Slide4.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Okay, so crowdsourced transcription has been going on for a while, and it's been happening in four different traditions that developed this all independently.&amp;nbsp; You have genealogists who are doing this, primarily with things like census records.&amp;nbsp; The 1940 census is the most prominent example: they have volunteers transcribing as many as &lt;a href="https://familysearch.org/blog/en/familysearch-indexers-leave-legacy-record-setting-event/"&gt;ten million records a day&lt;/a&gt;.&amp;nbsp; The natural sciences are doing something similar, particularly 
GalaxyZoo, the OldWeather people are looking at climate change data, 
where you have to look at old, handwritten records to figure out how the
 climate has changed, because you need to know how the climate used to 
be. And then there are also some projects going on in the Open Source/Creative Commons world: the Wikisource people--particularly the German language Wikisource community--and libraries, archives, and museums have jumped into this recently.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-Np8kPvPzxx8/UJ5PVYopzQI/AAAAAAAAAhg/EPLzdB8npCc/s1600/Slide5.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-Np8kPvPzxx8/UJ5PVYopzQI/AAAAAAAAAhg/EPLzdB8npCc/s320/Slide5.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So here are a couple of examples from the citizen science world.&amp;nbsp; &lt;a href="http://oldweather.org/"&gt;OldWeather&lt;/a&gt; has a tool that allows people to record ship log book entries and weather observations.&amp;nbsp; As you can see, this is all field based -- this isn't quite an attempt to represent a document.&amp;nbsp; We'll get back to this in a minute.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-tas8CAeznME/UJ5PW4lJQkI/AAAAAAAAAho/xoAxO3KlJ5c/s1600/Slide6.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-tas8CAeznME/UJ5PW4lJQkI/AAAAAAAAAho/xoAxO3KlJ5c/s320/Slide6.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
The &lt;a href="http://www.pwrc.usgs.gov/bpp/index.cfm"&gt;North American Bird Phenology Program&lt;/a&gt; is transcribing old bird[-watching] observation cards from about a hundred years ago.&amp;nbsp; They're recording species names and all sorts of other things about this particular Grosbeak in 1938.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-Ddv-PdJjbkg/UJ5PXr94baI/AAAAAAAAAhw/MdLSo4F6sx8/s1600/Slide7.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-Ddv-PdJjbkg/UJ5PXr94baI/AAAAAAAAAhw/MdLSo4F6sx8/s320/Slide7.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
All of these--this is the majority of the crowdsourced transcription that's happening out there--there are millions of records--there are millions of records that are happening that are all record based.&amp;nbsp; These are not document-based, they aren't page-based.&amp;nbsp; They're dealing with data that is fundamentally tabular -- those are their inputs.&amp;nbsp; Their outputs are databases that they want to be able to either search or analyze.&amp;nbsp; So we're producing nothing that anyone would ever want to print out.&lt;br /&gt;
&lt;br /&gt;
And another interesting thing about this is that these record-based transcription projects--the uses are understood in advance.&amp;nbsp; If you're building a genealogy index, you know that people are going to want to search for names and be able to see the results.&amp;nbsp; And that's &lt;b&gt;it&lt;/b&gt; -- you're not building something that allows someone to go off and do some other kind of analysis.&lt;br /&gt;
&lt;br /&gt;
Now what kind of mark-up are these record-based transcription projects using?&amp;nbsp; Well, it's kind of idiosyncratic, at best.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-8nI2K0aPjTE/UJ5PZgr0KXI/AAAAAAAAAh4/PpkViKamJCM/s1600/Slide8.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-8nI2K0aPjTE/UJ5PZgr0KXI/AAAAAAAAAh4/PpkViKamJCM/s320/Slide8.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Here's an example from my client FreeREG.&amp;nbsp; This is &lt;a href="http://www.freereg.org.uk/howto/transcribe.htm"&gt;a mark-up language&lt;/a&gt; that they developed about ten years ago for indicating unclear readings of manuscripts.&amp;nbsp; It's actually fairly sophisticated--it's based on the regular expression programming sub-language--but it's not anything that's informed by the TEI world.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-RJahUKLldFc/UJ5PbSP-JlI/AAAAAAAAAiA/yFJ7Xe8-jto/s1600/Slide9.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-RJahUKLldFc/UJ5PbSP-JlI/AAAAAAAAAiA/yFJ7Xe8-jto/s320/Slide9.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
On the other hand, here is the mark-up that the &lt;a href="http://menus.nypl.org/"&gt;New York Public Library&lt;/a&gt; is using.&amp;nbsp; Let me read this out to you: "Please type the text of the indicated dish exactly as it appears.&amp;nbsp; &lt;b&gt;Don't worry about accents.&lt;/b&gt;"&amp;nbsp; This is almost an anti-markup.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-k6Bv2IDIzLA/UJ5OxYXaqHI/AAAAAAAAAcw/Rl-eQ86cGcw/s1600/Slide10.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-k6Bv2IDIzLA/UJ5OxYXaqHI/AAAAAAAAAcw/Rl-eQ86cGcw/s320/Slide10.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So what about free-form transcription?&amp;nbsp; There's a lot of development of people doing free-form transcription.&amp;nbsp; You have Scripto out of CHNM.&amp;nbsp; You have a couple of different (perhaps competing) NARA initiatives.&amp;nbsp; Wikisource.&amp;nbsp; There's my own FromThePage.&amp;nbsp; What kind of mark-up are they doing?&amp;nbsp; Well, for the most part, none!&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-tOb-J4giltY/UJ5OydtNq4I/AAAAAAAAAc4/LpNUSx2TAm4/s1600/Slide11.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-tOb-J4giltY/UJ5OydtNq4I/AAAAAAAAAc4/LpNUSx2TAm4/s320/Slide11.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Here's Scripto--the &lt;a href="http://wardepartmentpapers.org/"&gt;Papers of the War Department&lt;/a&gt;-- and you type what you see, and that's what you get.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-T-CUlLSpUAs/UJ5O03iRxAI/AAAAAAAAAdA/xtGKcYLY_a8/s1600/Slide12.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-T-CUlLSpUAs/UJ5O03iRxAI/AAAAAAAAAdA/xtGKcYLY_a8/s320/Slide12.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Here is the &lt;a href="http://fr.wikisource.org/wiki/Proc%C3%A8s_verbal_de_visite_pastorale_de_Gatti%C3%A8res"&gt;French-language Wikisource&lt;/a&gt;, hosting materials from the Archives departmentales du Cantal&amp;nbsp; (who are &lt;a href="http://archives.cantal.fr/?id=actualites"&gt;doing some very cool things&lt;/a&gt; here).&amp;nbsp; But this is just typing things into a wiki and not even internally using wiki links.&amp;nbsp; This is almost pre-formed text -- it's pretty much plaintext. &amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-4_eOu9FLc8g/UJ5O2gxGRJI/AAAAAAAAAdI/aNDDsanQW_U/s1600/Slide13.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-4_eOu9FLc8g/UJ5O2gxGRJI/AAAAAAAAAdI/aNDDsanQW_U/s320/Slide13.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
My own project, &lt;a href="http://beta.fromthepage.com/display/display_page?ol=w_rw_p_pl&amp;amp;page_id=756"&gt;FromThePage&lt;/a&gt;. &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-bfPxZ9ES_II/UJ5O38TeWzI/AAAAAAAAAdQ/zr8QPU9JXFk/s1600/Slide14.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-bfPxZ9ES_II/UJ5O38TeWzI/AAAAAAAAAdQ/zr8QPU9JXFk/s320/Slide14.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
I'm internally using wiki-links, but really only for creating indexes 
and annotations, not for indicating...any of the power that you have 
with TEI.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-7vOt8th_GHw/UJ5O4tft4HI/AAAAAAAAAdY/nbqhWP_vtOA/s1600/Slide15.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-7vOt8th_GHw/UJ5O4tft4HI/AAAAAAAAAdY/nbqhWP_vtOA/s320/Slide15.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So if no one is using TEI, why is TEI important?&amp;nbsp; I think that TEI is important because &lt;b&gt;crowdsourced transcription projects are how the public is interacting with edition&lt;/b&gt;.&amp;nbsp; This is how people are learning what editing is, what the editing process is, and why and whether it's important.&amp;nbsp; And they're using tools that are developed by people like me.&amp;nbsp; Now how do people like me learn about edition?&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-TcaAmJ_ZF_s/UJ5O6b9We7I/AAAAAAAAAdg/ucIb1nE4IhI/s1600/Slide16.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-TcaAmJ_ZF_s/UJ5O6b9We7I/AAAAAAAAAdg/ucIb1nE4IhI/s320/Slide16.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
The answer is, by reading the TEI Guidelines.&amp;nbsp; The TEI Guidelines have an impact that goes far beyond people who are actually implementing TEI.&amp;nbsp; I started work on FromThePage in complete isolation in 2005.&amp;nbsp; By 2007, I was reading the TEI Guidelines.&amp;nbsp; I wasn't implementing TEI, but the questions that were asked--these notions of "here's how you expand abbreviations", "here's how you regularize things"--had a tremendous impact on me.&amp;nbsp; By contrast, the Guide to Documentary Editing--which is a wonderful book!--I only found out in January of this year. &lt;br /&gt;
&lt;br /&gt;
TEI is online, it's concise, it's available.&amp;nbsp; And when I talk to people in the genealogy development world, they know about TEI. They've heard of it.&amp;nbsp; They have opinions.&amp;nbsp; They're not using it, but -- you people are making an impact on how the world does edition!&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-tMYP27OQLAM/UJ5O7NbEObI/AAAAAAAAAdo/Siu3iHrfmAI/s1600/Slide17.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-tMYP27OQLAM/UJ5O7NbEObI/AAAAAAAAAdo/Siu3iHrfmAI/s320/Slide17.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Okay, so if all of these people aren't using TEI, who &lt;b&gt;is &lt;/b&gt;doing it?&lt;br /&gt;
&lt;br /&gt;
I run a &lt;a href="http://tinyurl.com/TranscriptionToolGDoc"&gt;transcription tool directory&lt;/a&gt; that is itself crowdsourced.&amp;nbsp; It's been edited by 23 different people who've entered information about 27 different tools. Of those 27 tools, 7 are marked as "supporting TEI".&amp;nbsp; There's a little column, "does it support TEI?", seven of them say "Yes".&lt;br /&gt;
&lt;br /&gt;
Actually, that's not true.&amp;nbsp; Some of them say "yes", but some of those seven say "well, sort of".&amp;nbsp; So what does that mean?&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-XYF1GGS2MT0/UJ5O7-eKSaI/AAAAAAAAAdw/fgKDXR3S0G0/s1600/Slide18.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-XYF1GGS2MT0/UJ5O7-eKSaI/AAAAAAAAAdw/fgKDXR3S0G0/s320/Slide18.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
To find that out, I interviewed five of those seven projects.&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.transcribe-bentham.da.ulcc.ac.uk/td/Transcribe_Bentham"&gt;Transcribe Bentham&lt;/a&gt;.&amp;nbsp;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://t-pen.org/TPEN/"&gt;T-PEN&lt;/a&gt; (which there's a poster session about tonight), which is a line-based system for medieval manuscripts.&amp;nbsp;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;A customization of T-PEN, the &lt;a href="http://ccl.rch.uky.edu/"&gt;Carolingian Canon Law&lt;/a&gt; project, out of the University of Kentucky.&amp;nbsp;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Our own Hugh Cayless for the &lt;a href="http://papyri.info/"&gt;Papyrological Editor&lt;/a&gt;, which is dealing with papyri.&amp;nbsp;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;And then &lt;a href="http://www.mom-wiki.uni-koeln.de/"&gt;MOM-CA&lt;/a&gt; is one of these "sort of"s.&amp;nbsp; You have two implementations of it.&amp;nbsp;&amp;nbsp;&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt;One of them is the &lt;a href="http://www.vdu.uni-koeln.de/vdu/home"&gt;Virtualles deutsches Urkundennetzwerk&lt;/a&gt;, which is a German charter collection.&amp;nbsp; It supports "TEI, sort-of" -- actually it supports CEI and EAD.&amp;nbsp;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;But it's been customized for extensive TEI support for the &lt;a href="http://itineranova.be/in/home"&gt;Itinera Nova&lt;/a&gt; project which is out of the archive of Leuven, Belgium. &amp;nbsp; &amp;nbsp;&amp;nbsp; &lt;/li&gt;
&lt;/ul&gt;
&lt;/ul&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-r7sR-FHyT08/UJ5O8YRJHUI/AAAAAAAAAd4/FQmqqVY0MCc/s1600/Slide19.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-r7sR-FHyT08/UJ5O8YRJHUI/AAAAAAAAAd4/FQmqqVY0MCc/s320/Slide19.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
I'm going to talk about what I found out, but I'm going to emphasize Transcribe Bentham.&amp;nbsp; Not because it's better than the other tools, but because they actually ran their transcription project as an experiment.&amp;nbsp; They wanted to know, &lt;i&gt;can the public do TEI? Can the public handle it?&lt;/i&gt;&amp;nbsp; And they've &lt;a href="http://www.digitalhumanities.org/dhq/vol/6/2/000125/000125.html"&gt;published their results&lt;/a&gt;: they've conducted user surveys of &lt;i&gt;what was your experience using TEI?&lt;/i&gt;&amp;nbsp; Which makes it particularly useful for those of us who are trying to figure out how it's being used.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-YRbWmpy1Oes/UJ5O9tAwD3I/AAAAAAAAAeI/fJjyUyS2Mfs/s1600/Slide20.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-YRbWmpy1Oes/UJ5O9tAwD3I/AAAAAAAAAeI/fJjyUyS2Mfs/s320/Slide20.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Okay, so there's a lot of variation among these projects.&amp;nbsp; You've got a varied committment to TEI.&amp;nbsp; Transcribe Bentham: &lt;i&gt;Yes, we're going to use TEI!&lt;/i&gt;&amp;nbsp; You see Melissa Terras here saying that &lt;i&gt;"it was untenable" that we'd ask for anything else.&amp;nbsp; These people know how to do it; why would we depart from that?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
For T-PEN, James Ginther says: &lt;i&gt;Hey, I'm kind of skeptical.&amp;nbsp; We'll support any XSD you want to upload, if it happens to be TEI, that's okay.&lt;/i&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-kZwfUDgsqhs/UJ5O-V6q8tI/AAAAAAAAAeQ/RQEYFqTD2MM/s1600/Slide21.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-kZwfUDgsqhs/UJ5O-V6q8tI/AAAAAAAAAeQ/RQEYFqTD2MM/s320/Slide21.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Abigail Firey, who's using T-PEN, basically says: look, &lt;i&gt;it's probably necessary.&amp;nbsp; It's very useful.&amp;nbsp; It lets us develop these valuable intellectual perspectives on our text.&lt;/i&gt;&amp;nbsp; And she considered it &lt;i&gt;important that their text encoding was done within the community of practice&lt;/i&gt; represented by the people in this room.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-oskJ_tjKLbc/UJ5O-1c71EI/AAAAAAAAAeY/ayHL7zrpTOI/s1600/Slide22.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-oskJ_tjKLbc/UJ5O-1c71EI/AAAAAAAAAeY/ayHL7zrpTOI/s320/Slide22.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Okay, so more variation between these.&amp;nbsp; Where's the TEI located within these projects?&amp;nbsp; Where does it live?&amp;nbsp; I'm a developer; I'm interested in the application stack.&lt;br /&gt;
&lt;br /&gt;
It turns out that there's no agreement at all.&amp;nbsp; Transcribe Bentham has people entering TEI in person.&amp;nbsp; And then it's storing it off in a MediaWiki, using MediaWiki versioning, not actually putting [...] pages in one big TEI document.&lt;br /&gt;
&lt;br /&gt;
On the other hand, Itinera Nova is actually storing everything in an XRX-based XML database.&amp;nbsp; I mean, it is pure TEI on the back end.&amp;nbsp; But none of the volunteers using Itinera Nova actually are typing any angle brackets.&amp;nbsp; So we have a lot of variation here.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-Q4jOo1k_OjU/UJ5O_Q280cI/AAAAAAAAAeg/0AmC2ZjOGYc/s1600/Slide23.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-Q4jOo1k_OjU/UJ5O_Q280cI/AAAAAAAAAeg/0AmC2ZjOGYc/s320/Slide23.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
However, there was no variation when I asked people about encoding.&amp;nbsp; There is a perfectly common perception that is: &lt;i&gt;Encoding is hard!&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
And there are these great responses--that you can see both on the Transcribe Bentham blog and in their DHQuarterly paper that just came out, which I highly recommend--describing it as "too much markup", "unnecessarily complicated", "a hopeless nightmare", and &lt;i&gt;the entire transcription process is "a horror."&lt;/i&gt; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-Cm-Q-2jWmpk/UJ5PACJGTZI/AAAAAAAAAeo/zdP_Eb94MO0/s1600/Slide24.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-Cm-Q-2jWmpk/UJ5PACJGTZI/AAAAAAAAAeo/zdP_Eb94MO0/s320/Slide24.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
But, &lt;b&gt;lots of things are hard.&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
In my own experience with FromThePage, I have one user who has transcribed one thousand pages, but she does not like using any mark-up at all.&amp;nbsp; She's contributing!&amp;nbsp; She's contributing plaintext transcriptions, but I'm going back to add wikilinks.&amp;nbsp; So it's not about the angle brackets.&amp;nbsp; (Maybe square brackets have a problem too, I don't know.)&lt;br /&gt;
&lt;br /&gt;
And fundamentally, transcribing--reading old manuscripts--&lt;b&gt;is hard.&lt;/b&gt;&amp;nbsp; "Deciphering Bentham's hand took longer than encoding," for over half of the Bentham respondents.&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-wZWeulLZMrk/UJ5PAeXUt2I/AAAAAAAAAew/7t1ANw4h44A/s1600/Slide25.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-wZWeulLZMrk/UJ5PAeXUt2I/AAAAAAAAAew/7t1ANw4h44A/s320/Slide25.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So there's more commonality: everyone wants to make encoding easier.&amp;nbsp; How do we do that?&amp;nbsp; There's a couple of different approaches.&amp;nbsp; One approach--the most common approach--is using different kinds of buttons and menus to automate the insertion of tags.&amp;nbsp; Which gets around (primarily) the need for people to memorize tag names and attributes, and--God help us--close tags.&lt;br /&gt;
&lt;br /&gt;
So these are implemented--we've got buttons on T-PEN and CCL.&amp;nbsp; We've got buttons on the TEI Toolbar.&amp;nbsp; We've got menus on VdU and the Papyrological Editor.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-gXPNZyLZZ6U/UJ5PCQjMoGI/AAAAAAAAAe4/X0tSieF8icA/s1600/Slide26.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-gXPNZyLZZ6U/UJ5PCQjMoGI/AAAAAAAAAe4/X0tSieF8icA/s320/Slide26.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
And you can see them.&amp;nbsp; Here's a screenshot of Jeremy Bentham.&amp;nbsp; A couple of interesting things about this: it's very small, but we've got a toolbar at the top.&amp;nbsp; We've got TEI text: angle-bracket D.E.L.&amp;nbsp; Angle-bracket, slash, D.E.L.&amp;nbsp; So we're actually exposing the TEI to users in Transcribe Bentham, though we're providing them with some buttons.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-04n_WJ9H4Gw/UJ5PDzc_dqI/AAAAAAAAAfA/AFJ4q3c0r0Q/s1600/Slide27.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-04n_WJ9H4Gw/UJ5PDzc_dqI/AAAAAAAAAfA/AFJ4q3c0r0Q/s320/Slide27.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Those buttons represent a subset--I'll get to the selection of those tags later.&amp;nbsp; Here's a more detailed description of what they do.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-CKc1Q5p9bu8/UJ5PE4dL5DI/AAAAAAAAAfI/guF7517DdCI/s1600/Slide28.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-CKc1Q5p9bu8/UJ5PE4dL5DI/AAAAAAAAAfI/guF7517DdCI/s320/Slide28.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Here's what's going on with VdU.&amp;nbsp; Only in this case, they're not actually exposing the angle brackets to the user. They're replacing all of these in a pseudo-WYSIWYG that allows people to choose from a menu and select text that then gets tagged.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-qXHkqKRtKtE/UJ5PFYMEonI/AAAAAAAAAfQ/ZsdRBX07R2Y/s1600/Slide29.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-qXHkqKRtKtE/UJ5PFYMEonI/AAAAAAAAAfQ/ZsdRBX07R2Y/s320/Slide29.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Okay -- limitations of the buttons.&amp;nbsp; There's a good limitation, which is that as users become more comfortable with TEI, they outgrow buttons.&amp;nbsp; And this is something that the people at Transcribe Bentham reported to me.&amp;nbsp; They're seeing a fair number of people just skip the buttons altogether and type angle brackets.&amp;nbsp; Remember: these are members of the public who have never met any of the Transcribe Bentham people.&lt;br /&gt;
&lt;br /&gt;
On the down side, users also ignore the buttons.&amp;nbsp; Again users ignoring encoding, but in this case we've got something that's a little bit worse.&amp;nbsp; Georg Vogeler is reporting something very interesting, which is that in a lot of cases, they were seeing users who were using print apparatus for doing this kind of work, and just ignoring the buttons -- going around them.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-Fs0igceVZNE/UJ5PGrPL1BI/AAAAAAAAAfg/w_KgMfKmncw/s1600/Slide30.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-Fs0igceVZNE/UJ5PGrPL1BI/AAAAAAAAAfg/w_KgMfKmncw/s320/Slide30.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So the problem with using print-style notations.&amp;nbsp; People are dealing with these print editions [notations] -- this can be a problem or it can be an opportunity.&amp;nbsp; Papyri.info is viewing it that way.&amp;nbsp; Itinera Nova is using it that way.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-DitUMLwVSas/UJ5PHkTAmtI/AAAAAAAAAfo/CWi46hBlu4c/s1600/Slide31.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-DitUMLwVSas/UJ5PHkTAmtI/AAAAAAAAAfo/CWi46hBlu4c/s320/Slide31.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Papyri.info, their front-end interface for most users is Leiden+, which is a standard for marking up papyri.&amp;nbsp; And, as you can see, users enter text in Leiden+, and that generates TEI.&amp;nbsp; (EpiDoc TEI, I believe.)&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-rr2z16FcALY/UJ5PIRJ7JAI/AAAAAAAAAfw/9C8JMjQqFi4/s1600/Slide32.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-rr2z16FcALY/UJ5PIRJ7JAI/AAAAAAAAAfw/9C8JMjQqFi4/s320/Slide32.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
This is the same kind of process that's done in Itinera Nova.&amp;nbsp; In that case, they're using for notation whatever it is that the Leuven archives uses for their mark-up. And they're doing the same kind of transposition [ed: translation] of replacing their notation with TEI tags before they save it.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-_qmYhlEHpl8/UJ5PKJud4_I/AAAAAAAAAf4/bSfQ4BIMnsw/s1600/Slide33.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-_qmYhlEHpl8/UJ5PKJud4_I/AAAAAAAAAf4/bSfQ4BIMnsw/s320/Slide33.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
And this is actually what users see as they're typing. They don't see the TEI tags -- we're hiding the angle brackets from them.&lt;br /&gt;
&lt;br /&gt;
So this is an alternative to buttons.&amp;nbsp; And in my opinion, it's not that bad an alternative.&amp;nbsp; &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-k3_eGKwPjyw/UJ5PK3DLftI/AAAAAAAAAgA/boiz__XYng8/s1600/Slide34.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-k3_eGKwPjyw/UJ5PK3DLftI/AAAAAAAAAgA/boiz__XYng8/s320/Slide34.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
This hasn't been a problem for the Bentham people, however.&amp;nbsp; It's a non-problem for them. And they are the most "crowdy", the most amateur-focused, and the most committed to a TEI interface.&lt;br /&gt;
&lt;br /&gt;
Tim Causer went through and reviewed all of this and said, you know, &lt;i&gt;it just doesn't happen.&lt;/i&gt;&amp;nbsp; People are not using any print notation at all.&amp;nbsp; They're using buttons.&amp;nbsp; They're using angle-brackets by hand.&amp;nbsp; They're not even using plaintext.&amp;nbsp; They're using TEI.&amp;nbsp; Their users are comfortable with TEI.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-bEejYsFJ4QI/UJ5PLUrrZ_I/AAAAAAAAAgI/lSxI8is9NNU/s1600/Slide35.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-bEejYsFJ4QI/UJ5PLUrrZ_I/AAAAAAAAAgI/lSxI8is9NNU/s320/Slide35.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So what accounts for the difference between the experience of the VdU and the Transcribe Bentham people?&amp;nbsp; I don't know.&amp;nbsp; I've got a couple of theories about what might be going on.&lt;br /&gt;
&lt;br /&gt;
One of them is really the corpus of texts we're working with.&amp;nbsp; If you're only dealing with papyrus fragments, and you're used to a well-established way of notating them--that's been around since 1935 in the case of Leiden+--well, it's kind of hard to break out of that.&amp;nbsp; On the other hand, there's not a single convention for print editions.&amp;nbsp; There's all sorts of ways of indicating additions and deletions for print editions of more modern texts.&amp;nbsp; So maybe it's a lack of a standard.&lt;br /&gt;
&lt;br /&gt;
Or, maybe it's who the users are.&amp;nbsp; Maybe scholars are stubborner, and amateurs are more tractable and don't have bad habits to break.&amp;nbsp; I don't know!&amp;nbsp; I don't know, but I'd be really interested in any other ideas.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-zaSReRRb98I/UJ5PMK4KfLI/AAAAAAAAAgQ/zsqqZ8dJVUk/s1600/Slide36.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-zaSReRRb98I/UJ5PMK4KfLI/AAAAAAAAAgQ/zsqqZ8dJVUk/s320/Slide36.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Okay, how do these projects choose the tags that they're dealing with?&amp;nbsp; We've got a very long quote, but I'm just going to read out a couple of little bits of them.&lt;br /&gt;
&lt;br /&gt;
Really, choosing a subset of tags is important.&amp;nbsp; Showing 67 buttons was not a good usability thing for T-PEN.&amp;nbsp; And in particular, what they ended up doing was getting rid of the larger, structural set of markup, and focusing just on sort of phrase-level markup.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-04n_WJ9H4Gw/UJ5PDzc_dqI/AAAAAAAAAfA/AFJ4q3c0r0Q/s1600/Slide27.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-04n_WJ9H4Gw/UJ5PDzc_dqI/AAAAAAAAAfA/AFJ4q3c0r0Q/s320/Slide27.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
This also, I think, true if we go back a minute and look at Bentham.&amp;nbsp; Here, again, we're talking phrase-level tags.&amp;nbsp; We're not talking about anything beyond that.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-smy_qpbWRIo/UJ5PM1oVIQI/AAAAAAAAAgY/qbD6ivfPWLk/s1600/Slide37.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-smy_qpbWRIo/UJ5PM1oVIQI/AAAAAAAAAgY/qbD6ivfPWLk/s320/Slide37.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Justin Tonra said that it was actually really hard to pare down the number of tags for Transcribe Bentham.&amp;nbsp; He wanted to do more, but, you know, he's pleased with what they got.&amp;nbsp; They didn't want to "overcomplicate the user's job."&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-7fin9JfMwD4/UJ5PNS2f0nI/AAAAAAAAAgg/tFA3SQ00i0o/s1600/Slide38.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-7fin9JfMwD4/UJ5PNS2f0nI/AAAAAAAAAgg/tFA3SQ00i0o/s320/Slide38.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Richard Davis, also with Transcribe Bentham, had a great deal of experience dealing with editors for EAD and other XML.&amp;nbsp; And he said you're always dealing with this balance between usability and flexibility, and there's just not much way of getting around it.&amp;nbsp; It's going to be a compromise, no matter what.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-kSzVEDh1vWs/UJ5PN8fbY-I/AAAAAAAAAgo/3jSF-xdC1s8/s1600/Slide39.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-kSzVEDh1vWs/UJ5PN8fbY-I/AAAAAAAAAgo/3jSF-xdC1s8/s320/Slide39.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So what's the future for these projects that are using TEI for crowds?&amp;nbsp; Well, if getting people up to speed is hard, and if nobody reads the help--as Valerie Wallace at one time said about their absolutely intimidating help page for Transcribe Bentham (you should look at it -- it's amazing!)--then what are the alternatives for getting people up to speed?&lt;br /&gt;
&lt;br /&gt;
Georg Vogeler says that they are trying to come up with a way of teaching people how to use the tool and how to use the markup in almost a game-like scenario.&amp;nbsp; We're not talking about the kind of Whak-a-Mole things that we sometimes see, but really just sort of leading people through &lt;i&gt;Let's try this. Now let's try this. Now let's try this. Okay now you know how to deal with this [tool].&amp;nbsp; &lt;/i&gt;It's something that I think we're actually pretty familiar with from any other kinds of projects dealing with historic handwriting.: people have to come up to speed.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-9B1O0cZOCeM/UJ5PO636smI/AAAAAAAAAg4/yfI8y1Mi2lI/s1600/Slide40.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-9B1O0cZOCeM/UJ5PO636smI/AAAAAAAAAg4/yfI8y1Mi2lI/s320/Slide40.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Another possibility is a WYSIWYG.&amp;nbsp; Tim Causer announced the idea of spending their new Mellon grant on building a WYSIWYG for Transcribe Bentham's TEI.&amp;nbsp; The &lt;a href="http://blogs.ucl.ac.uk/transcribe-bentham/2012/10/01/help-improve-transcribe-bentham/"&gt;blog entry&lt;/a&gt; is fascinating because he gets about seven user comments, some of which express a whole lot of skepticism that a WYSIWYG is going to be able to handle nested tagging in particular.&amp;nbsp; Other ones of which make comments about the whole XML system and its usability in vivid prose, which is very worth reading.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-wQSkfUmiMRk/UJ5PQ4iC73I/AAAAAAAAAhA/5SeUMBJs37I/s1600/Slide41.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-wQSkfUmiMRk/UJ5PQ4iC73I/AAAAAAAAAhA/5SeUMBJs37I/s320/Slide41.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
And maybe combinations of these.&amp;nbsp; So we have these intermediate notations -- Itinera Nova, for example, they're using this &lt;i&gt;let's begin a strike-through with an equals sign&lt;/i&gt; (which is apparently what they've been using at that archive for a while).&amp;nbsp; And the minute you type that equals sign in, you actually get a WYSIWYG strike-through that runs all the way through your transcript.&lt;br /&gt;
&lt;br /&gt;
That may be the future.&amp;nbsp; We'll see.&amp;nbsp; I think that we have a lot of room for exploring different ways for handling this.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-5utC13IwKbg/UJ5PRoi1HPI/AAAAAAAAAhI/DdEsIyBBqq0/s1600/Slide42.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-5utC13IwKbg/UJ5PRoi1HPI/AAAAAAAAAhI/DdEsIyBBqq0/s320/Slide42.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So let me wrap up and thank my interviewees.&lt;br /&gt;
&lt;br /&gt;
Transcribe Bentham: Melissa Terras, Justin Tonra, Tim Causer, Richard Davis.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-F1mrJAzDGmU/UJ5PSAdar7I/AAAAAAAAAhQ/8Sw0Yxa95tk/s1600/Slide43.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-F1mrJAzDGmU/UJ5PSAdar7I/AAAAAAAAAhQ/8Sw0Yxa95tk/s320/Slide43.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
T-PEN: James Ginther, Abigail Firey&lt;br /&gt;
Papyri.info: Hugh Cayless, Tom Elliot&lt;br /&gt;
MOM-CA: Georg Vogeler and Jochen Graf&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-sg2LwzzWtks/UJ5PStvrBgI/AAAAAAAAAhY/BTiXRx9t1Jo/s1600/Slide44.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-sg2LwzzWtks/UJ5PStvrBgI/AAAAAAAAAhY/BTiXRx9t1Jo/s320/Slide44.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;h4&gt;
Questions&lt;/h4&gt;
&lt;i&gt;[All questions will be paraphrased in the transcript due to sound quality, and are not to be regarded as direct quotations without verification via the audio.]&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Syd Bauman&lt;/b&gt;: &lt;i&gt;Of the systems which allow users to type tags free-hand, what percentage come out well-formed?&amp;nbsp;&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Me&lt;/b&gt;: The only one that presents free-hand [tagging] is Transcribe Bentham.  Tim [Causer] gets well-formed XML for most everything he gets.  There is no validation being performed by that wiki, but what he's getting is pretty good.  He says that the biggest challenge when he's post-processing documents is closing tags and mis-placed nesting.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Syd Bauman&lt;/b&gt;: &lt;i&gt;I'd be curious about the exact percentages.&lt;/i&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Me&lt;/b&gt;: Right.  I'd have to go back and look at my interview.  He said that it represents a pretty small percentage, like single digits of the submissions they get.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;John Unsworth&lt;/b&gt;: &lt;i&gt;Do any of the systems use keyboard short-cuts?&lt;/i&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Me&lt;/b&gt;: I know of none that use hot-keys.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;John Unsworth&lt;/b&gt;: &lt;i&gt;Do you think that would be more or less desirable than the systems you've described?&lt;/i&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Me&lt;/b&gt;: I really only see hot-keys as being desirable for projects that are using more recent and clearer documents.  Speed of data-entry from the keyboard perspective doesn't help much when you're having to stare and zoom and scroll on a document that is as dense and illegible as Bentham or Greek papyri.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Elena Pierazzo&lt;/b&gt; [very faint audio]: &lt;i&gt;In some cases it's hard to define which is the error: choosing the tags or reading the text.  I've been working with my students on Transcribe Bentham--they're all TEI-aware--and to be honest it was hard.  The difficulty was not the mark-up.  In a sense we do sometimes forget in these crowdsourcing projects, that the text itself is very hard, so probably adding a level of complexity to the task via the mark-up is very difficult.&lt;/i&gt;  
&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;I have all respect and sympathy for the people who stick to the ideal of doing TEI, which I commend entirely.  But in some cases, it may be that asking amateur people to do [the decipherment] and do the mark up is a pretty strong request, and makes a big assumption about what the people "out there" are capable of without formation.
&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Me&lt;/b&gt;: I'd agree with you.  However, there have been some studies on these users' ability to produce quality transcripts outside of the TEI world.... Old Weather did a great deal of research on that, and they found that individual users tended to submit correct transcripts 97% of the time.  They're doing blind triple-keying, so they're comparing people's transcripts against others.  [They found] that of 1000 different entries, typically on average 13 will be wrong.  Of those thirteen, three will be due to user error--so it does happen; I'm not saying people are perfect.  Three will be generally[ed: genuinely] illegible.  And the remaining seven will be due to the officer of the watch having written the wrong thing down and placing the ship in Afghanistan instead of in the Indian Ocean.  So there are errors everywhere.  

[I mis-remembered the numbers here: actually it's 3 errors due to transcriber error, 10 genuinely illegible, and 3 due to error at time of inscription.]
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Lou Burnard&lt;/b&gt;: &lt;i&gt;The concept of error is a nuanced one.  I would like to counter-argue Elena's [point].  I think that one of the reasons that Bentham has been successful is precisely because it's difficult material.  Why do I think that?  Because if you are faced with something difficult, you need something powerful to express your understanding of it.  The problem with not using something as rich and semantically expressive as TEI when you're doing your transcription is that it doesn't exist!  All you can do is type in the words you think it might have been, and possibly put in some arbitrary code to say, "Well, I'm not sure about that."  Once you've mastered the semantics of the TEI markup--which doesn't actually take that long, if you're interested in it--now you can express yourself.  Now you can communicate in a [...] satisfactory way.  And I think that's why people like it.
&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Me&lt;/b&gt;: I have anecdotal, personal evidence to agree with you.  In my own system (that does not use TEI), I have had users who have transcribed several pages, and then they'd get to a table in some biologist's field notes, for example, and they stop.  And they say, "well, I don't know what to do here."  So they're done.
&lt;br /&gt;
&lt;br /&gt;
&lt;b&gt;Lou Burnard&lt;/b&gt;: &lt;i&gt;The example you cite of the erroneous data in the source is a very good one, because if you've mastered TEI then you know how to express in markup: 'this is what it actually says but clearly he wasn't in Afghanistan.'  And that isn't the case in any other markup system I've ever heard of.&amp;nbsp;&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;[I welcome corrections to my transcript or the contents of the talk itself at benwbrum@gmail.com or in the comments to this post.]&lt;/i&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/jQ_IaZkYN64" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/11/what-does-it-mean-to-support-tei-for.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/-O2VRdNayzDg/UJ5OwjBvK7I/AAAAAAAAAco/SCp3phizT-0/s72-c/Slide1.PNG" height="72" width="72" /><thr:total>5</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-5294053675706485674</guid><pubDate>Wed, 24 Oct 2012 15:10:00 +0000</pubDate><atom:updated>2013-02-26T19:37:45.902-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">similar projects</category><category domain="http://www.blogger.com/atom/ns#">interview</category><title>Interview with Ben Crowder on Unbindery</title><description>One of the pleasures of maintaining the &lt;a href="http://manuscripttranscription.blogspot.com/2012/04/crowdsourced-transcription-tool-list.html"&gt;crowdsourced transcription tool list&lt;/a&gt; is learning about systems I'd never heard about before.&amp;nbsp; One of these is &lt;a href="http://bencrowder.net/coding/unbindery"&gt;Unbindery&lt;/a&gt;, a tool being built by &lt;a href="http://bencrowder.net/"&gt;Ben Crowder&lt;/a&gt; for audio and manuscript transcription as well as OCR correction.&amp;nbsp; Ben was gracious enough to grant me an interview, even though he's concentrating on the &lt;a href="http://bencrowder.net/blog/category/unbindery/"&gt;final stretch of development work&lt;/a&gt; on Unbindery.&lt;i&gt; &lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-zqjlANzR6z0/UIqrtmify0I/AAAAAAAAAcQ/pPB75g1XWFM/s1600/unbindery.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="298" src="http://2.bp.blogspot.com/-zqjlANzR6z0/UIqrtmify0I/AAAAAAAAAcQ/pPB75g1XWFM/s400/unbindery.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;i&gt;First, let me wish you luck as you enter the final push on Unbindery.
What would you say is the most essential feature you have left to work
on?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
Thanks! Probably private projects -- I've been looking forward to using Unbindery to transcribe my journals, but haven't wanted them to be open for just anyone to work on. I'm also very excited about chunking audio into small segments (I used to publish an online magazine where we primarily published interviews, and transcribing two hours of audio can be really daunting).&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Tell us more about how Unbindery handles both audio transcription and
manuscript transcription.  Usually those tools are very different,
aren't they?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
The audio transcription part started out as &lt;a href="http://bencrowder.net/coding/crosswrite"&gt;Crosswrite&lt;/a&gt;, a little proof-of-concept I threw together when I realized one day that JavaScript would let me control the playhead on an audio element, making it really easy to write a software version of a transcription foot pedal. I also wanted to start using Unbindery for family history purposes (transcribing audio interviews with my grandparents, mainly, and divvying up that workload among my siblings).&lt;br /&gt;
&lt;br /&gt;
So, to handle both audio transcription and page image transcription, Unbindery has a modular item type editor system. Each item type has its own set of code (HTML/CSS/JavaScript) that it loads when transcribing an item. For example, page images show an image and a text box, with some JavaScript to place a highlight line when you click on the image, whereas audio items replace the image with Crosswrite's audio element (and the keyboard controls for rewinding and fast forwarding the audio). It would be fairly trivial to add, say, an item type editor that lets the user mark up parts of the transcript with XML tags pulled from a database or web service somewhere. Or an editor for transcribing video. It's pretty flexible.&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;How did you come up with the idea for Unbindery?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
I had done some Project Gutenberg work back in 2002, and somewhere along the way I came across &lt;a href="http://pgdp.net/"&gt;Distributed Proofreaders&lt;/a&gt;, which basically does the same thing. A few years later, I'd recently gotten home from an LDS mission to Thailand and wanted to start a Thai branch of Project Gutenberg with one of my mission friends. He came up with the name Unbindery and I made some mockups, but nothing happened until 2010 when I launched my &lt;a href="http://mormontexts.org/"&gt;Mormon Texts Project&lt;/a&gt;. Manually sending batches of images and text for volunteers to proof was laborious at best, so I was motivated to finally write Unbindery. I threw together a prototype in a couple weeks and we've been using it for MTP ever since. I'm also nearing the end of a complete rewrite to make Unbindery more extensible and useful to other people. And because the original code was ugly and nasty and seriously embarrassing.&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;In my experience, the transcription tools that currently exist are
very much informed by the texts they were built to work with, with
some concentrating on OCR-correction, others on semantic indexing, and
others on mark-up of handwritten changes to the text.  How do you feel
like the Mormon Texts Project has shaped the features and focus of
Unbindery?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
Mormon Texts Project has been entirely focused on correcting OCR for publication in nice, clean ebook editions, which is why we've gone with a plain old text box and not much more than that. (Especially considering that we were originally posting the books to Project Gutenberg, where our target output format was very plain text.)&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;What is your grand dream for Unbindery?  (Feel free to be sweeping
here and assume grateful, enthusiastic users and legions of cobbler's
elves to help with the code.)&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
To get men on Mars. No, really, I don't think my dreams for Unbindery are all that grand -- I'd be more than satisfied if it helps make transcription easier for users, whether working alone or in groups, and whether they're publishing ebooks or magazines or transcribing oral histories or journals or what have you.&lt;br /&gt;
&lt;br /&gt;
In an ideal world it would be wonderful if a small, dedicated group of coders were to adopt it and take care of it going forward. But I don't expect that. I'll get it to a state where I can publicly release it and people can use it, but other than bugfixes, I don't see myself doing much active development on Unbindery beyond that point. I know, I know, abandoning my project before it's even out the door makes me a horrible open source developer. But to be honest with you, I don't really even want to be an open source developer -- I'm far more interested in my other projects (like MTP) and I want to get back to doing those things. Unbindery is just a tool I needed, an itch I scratched because there wasn't anything out there that met my needs. People have expressed interest in using it so I'm putting it up on GitHub for free, but I don't see myself doing much with Unbindery after that. Sorry! This is the sad part of the interview.&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;What programming languages or technical frameworks do you work in?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
Unbindery is PHP with JavaScript for the front end. I love JavaScript, but I'm only using PHP because of its ubiquity -- I'd much, much, much rather use Python. But it's a lot easier for people to get PHP apps running on cheap shared hosts, so there you have it.&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;It seems like you're putting a lot of effort into ease of deployment.
How do you see Unbindery being used?  Do you expect to offer hosting,
do you hope people install their own instances, or is there another
model you hope to follow?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
I won't be offering hosting, so yes, I'm expecting people to install their own instances, and that's why I want it to be easy to install. (There may be some people who decide to offer hosting for it as well, and that's fine by me.)&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;How can people get involved with the project?&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
Coders: The code isn't quite ready for other people to hack on it yet, but it's getting a lot closer to that point. For now, coders can look at my &lt;a href="http://bencrowder.net/coding/unbindery"&gt;roadmap page&lt;/a&gt; to see what tasks need doing. (Also, it won't be long before I start adding issues to &lt;a href="http://github.com/bencrowder/unbindery/issues"&gt;GitHub&lt;/a&gt; so people can help squash bugs.)&lt;br /&gt;
&lt;br /&gt;
Other people: Once the core functionality is in place, just having people install it and test it would probably be the most helpful.&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;&lt;/i&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/Lb0_Jrua26g" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/10/interview-with-ben-crowder-on-unbindery.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/-zqjlANzR6z0/UIqrtmify0I/AAAAAAAAAcQ/pPB75g1XWFM/s72-c/unbindery.png" height="72" width="72" /><thr:total>1</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-3262492348477104250</guid><pubDate>Thu, 18 Oct 2012 13:51:00 +0000</pubDate><atom:updated>2013-02-26T19:47:42.853-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">crowdsourcing</category><category domain="http://www.blogger.com/atom/ns#">paper</category><title>Jens Brokfeld's Thesis on Crowdsourced Transcription</title><description>Although the field of transcription tools has become increasingly popular over the last couple of years, most academic publications on the topic focus on a single project and the lessons that project can teach.&amp;nbsp; While those provide invaluable advice on how to run crowdsourcing projects, they do not lend much help to memory professionals trying to decide which tools to explore when they begin a new project.&amp;nbsp; Jens Brokfeld's thesis for his MLIS degree at Fachhochschule Potsdam is the most systematic, detailed, and thorough review of crowdsourced manuscript transcription tools to date.&lt;br /&gt;
&lt;br /&gt;
After a general review of crowdsourcing cultural heritage, Brokfeld reviews &lt;a href="http://rose-holley.blogspot.com/"&gt;Rose Holley's&lt;/a&gt; checklist for crowdsourcing projects and then expands upon the part of &lt;a href="http://manuscripttranscription.blogspot.com/2012/05/transcription-tools-at-tcdl2012.html"&gt;my own TCDL presentation&lt;/a&gt; which discussed criteria for selecting transcription tools, synthesizing it with published work on the subject.&amp;nbsp; He then defines his own test criteria for transcription tools, about which more below.&amp;nbsp; Then, informed by seventy responses to a bilingual survey of crowdsourced transcription users, Brokfeld evaluates six tools (FromThePage, Refine!, Wikisource, Scripto, T-PEN, and the Bentham Transcription Desk) with forty-two pages (pp. 40-82) devoted to tool-specific descriptions of the capabilities and gaps within each system.&amp;nbsp; This exploration is followed by an eighteen-page comparison of the tools against each other (pp. 83-100). The whole paper is very much worth your time, and can be downloaded at the "Masterarbeit.pdf" link here: &lt;a href="http://opus4.kobv.de/opus4-fhpotsdam/frontdoor/index/index/docId/331"&gt;"Evaluation von Editionswerkzeugen zur nutzergenerierten Transkription handschriftlicher Quellen"&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
It would be asking too much of my limited German to translate the extensive tool descriptions, but I think I should acknowledge that I found no errors in Brokfeld's description of my own tool, &lt;a href="http://fromthepage.com/"&gt;FromThePage&lt;/a&gt;, so I'm confident in his evaluation of the other five systems.&amp;nbsp; However, I feel like I ought to attempt to abstract and translate some of his criteria for evaluation, as well as his insightful analysis of each tool's suitability for a particular target group.&lt;br /&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;b&gt;Chapter 5:&amp;nbsp; &lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;Prüfkriterien ("Test Critera")&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.1 Accessibility (by which he means access to transcription data from different personal-computer-based clients)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.1.1 Browser Support&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.2 Findability&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.2.1 Interfaces (including support for such API protocols as OAI-PMH, but including functionality to export transcripts in XML or to import facsimiles)&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt; &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.2.2 References to Standards (this includes support for normalization of personal and place names in the resulting editions)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.3 Longevity&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.3.1 License (is the tool released under an open-source license that addresses digital preservation concerns?)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.3.2 Encoding Format (TEI or something else?)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.3.3 Hosting&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.4 Intellectual Integrity (primarily concerned with support for annotations and explicit notation of editorial emendations)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.4.1 Text Markup&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.5 Usability (similar to "accessibility" in American usage)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.5.1 Transcription Mode (transcriber workflows)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.5.2 Presentation Mode (transcription display/navigation)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.5.3 Editorial Statistics (tracking edits made by individual users)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;5.5.4 User Management (how does the tool balance ease-of-use with preventing vandalism?)&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;I don't believe that I've seen many of these criteria used before, and would welcome a more complete translation.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;His comparison based on target group is even more innovative.&amp;nbsp; Brokfeld recognizes that different transcription projects have different needs, and is the first scholar to define those target groups.&amp;nbsp; Chapter 7 of his thesis defines those groups as follows:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;&lt;b&gt;Science&lt;/b&gt;:&amp;nbsp; The scientific community is characterized by concern over the richness of mark-up as well as a preference for customizability of the tool over simplicity of user interface. [Note: it is entirely possible that I mis-translated &lt;i&gt;Wissenschaft &lt;/i&gt;as "science" instead of "scholarship".]&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;&lt;b&gt;Family History&lt;/b&gt;: Usability and a simple transcription interface are paramount for family historians, but privacy concerns over personal data may play an important role in particular projects.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;&lt;b&gt;Archives&lt;/b&gt;: While archives attend to scholarly standards, their primary concern is for the transcription of extensive inventories of manuscripts -- for which shallow markup may be sufficient.&amp;nbsp; Archives are particularly concerned with support for standards.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;&lt;b&gt;Libraries&lt;/b&gt;: Libraries pay particular attention to bibliographical standards. They also may organize their online transcription projects by fonds, folders, and boxes.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;&lt;b&gt;Museums&lt;/b&gt;: In many cases museums possess handwritten sources which refer to their material collections.&amp;nbsp; As a result, their transcriptions need to be linked to the corresponding object.&lt;/span&gt;&lt;/span&gt;&lt;/blockquote&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;It's very difficult for me to summarize or extract Brokfeld's evaluation of the six different tools for five different target groups, since those comparisons are in tabular form with extensive prose explanations.&amp;nbsp; I encourage you to read the original, but I can provide a totally inadequate summary for the impatient:&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;FromThePage: Best for family history and libraries; worst for science.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;Refine!: Best for libraries, followed by archives; worst for family history.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;Wikisource: Best for libraries, archives and museums; worst for family history.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;Scripto: Best for museums, followed by archives and libraries; worst for family history and science.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;T-PEN: Best for science.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="goog-gtc-unit" id="goog-gtc-unit-1465"&gt;&lt;span class="goog-gtc-translatable goog-gtc-unit-highlight" dir="ltr"&gt;Bentham Transcription Desk: Best for libraries, archives and museums.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;i&gt;Note: This is a summary of part of a 140-page German document translated by an amateur.&amp;nbsp; Consult the original before citing or making decisions based on the information here. Jens Brokfeld welcomes questions and comments (in English or German) through this webform: &lt;a href="http://opus4.kobv.de/opus4-fhpotsdam/frontdoor/mail/toauthor/docId/331"&gt;http://opus4.kobv.de/opus4-fhpotsdam/frontdoor/mail/toauthor/docId/331&lt;/a&gt;.&lt;/i&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/oIbG3zlurbU" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/10/jens-brokfelds-thesis-on-crowdsourced.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-4586527144037432123</guid><pubDate>Wed, 10 Oct 2012 14:30:00 +0000</pubDate><atom:updated>2013-02-26T19:46:53.791-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">presentations</category><category domain="http://www.blogger.com/atom/ns#">crowdsourcing</category><category domain="http://www.blogger.com/atom/ns#">videos</category><title>Webwise Reprise on Crowdsourcing</title><description>Back in June, the folks at IMLS and Heritage Preservation ran a webinar exploring the issues and tools discussed at the IMLS Webwise Crowdsourcing panel "&lt;b&gt;Sharing Public History Work: Crowdsourcing Data and Sources."&lt;/b&gt; &lt;br /&gt;
&lt;br /&gt;
After a introduction by Kevin Cherry and Kristen Laise,&amp;nbsp; &lt;a href="http://www.6floors.org/bracket/"&gt;Sharon Leon&lt;/a&gt;, who chaired the live panel, presented a wonderful overview of crowdsourcing cultural heritage and discussed the kinds of crowdsourcing projects that have been successful -- including, of course, the &lt;a href="http://wardepartmentpapers.org/"&gt;Papers of the War Department&lt;/a&gt; and &lt;a href="http://scripto.org/"&gt;Scripto&lt;/a&gt;, the transcription tool the Roy Rosenzweig Center for History and New Media developed from that project.&amp;nbsp; They then ran the &lt;a href="http://www.tvworldwide.com/events/webwise/120229/globe_show/default_go_archive.cfm?gsid=1971&amp;amp;type=flv&amp;amp;test=0&amp;amp;live=0"&gt;video &lt;/a&gt;of my own presentation, &lt;a href="http://manuscripttranscription.blogspot.com/2012/03/crowdsourcing-at-imls-webwise-2012.html"&gt;"Lessons from Small Crowdsourcing Projects"&lt;/a&gt;, followed by a live demo of &lt;a href="http://fromthepage.com/"&gt;FromThePage&lt;/a&gt;.&amp;nbsp; Perhaps the best part of the webinar, however, was the Q&amp;amp;A from people all over the country asking for details about how these kinds of projects work.&lt;br /&gt;
&lt;br /&gt;
The &lt;a href="http://www.connectingtocollections.org/2012-webwise-crowdsourcing/"&gt;recording of the webinar&lt;/a&gt; is online, and I encourage you to check it out.&amp;nbsp; (Here's a &lt;a href="http://squirrel.adobeconnect.com/p1wspy893ub/"&gt;direct link&lt;/a&gt;, if you have trouble.) I'm very grateful to IMLS and Heritage Preservation for their work in making this knowledge accessible so effectively.&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/u0iMVXAK3NI" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/10/webwise-reprise-on-crowdsourcing.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-1340285279127406001</guid><pubDate>Tue, 09 Oct 2012 15:49:00 +0000</pubDate><atom:updated>2013-02-26T19:48:29.562-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">fromthepage projects</category><title>Mosman 1914-1918 on FromThePage</title><description>The Mosman community in New South Wales is preparing for the centennial of World War One, and as part of this project they've launched &lt;a href="http://mosman1914-1918.net/"&gt;mosman1914-1918.net&lt;/a&gt;: "Doing our bit, Mosman 1914–1918".&amp;nbsp; The project describes itself as "an innovative online resource to collect and display information about the wartime experiences of local service people," and includes scan-a-thons, hack days, and build-a-thons.&lt;br /&gt;
&lt;br /&gt;
One of their efforts involves &lt;a href="http://fromthepage.com/collection/show?collection_id=12"&gt;transcription of a serviceman's diary with links to related names on local honor boards&lt;/a&gt;.&amp;nbsp; I'm delighted to report that they're hosting this project on &lt;a href="http://fromthepage.com/"&gt;FromThePage.com&lt;/a&gt;, and I look forward to working with and learning from the Mosman team.&lt;br /&gt;
&lt;br /&gt;
Read more about Allan Allsop's diary on the &lt;a href="http://mosman1914-1918.net/project/blog/help-us-transcribe-this-diary"&gt;Mosman 1914-1918 project blog&lt;/a&gt; and lend a hand transcribing!&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/FFVuqoIhlHE" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/10/mosman-1914-1918-on-fromthepage.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-8733131299847203621</guid><pubDate>Wed, 03 Oct 2012 14:13:00 +0000</pubDate><atom:updated>2013-02-26T19:54:44.444-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">indexing</category><category domain="http://www.blogger.com/atom/ns#">client projects</category><category domain="http://www.blogger.com/atom/ns#">structured transcription</category><title>Building a Structured Transcription Tool with FreeUKGen</title><description>I'm &lt;a href="http://manuscripttranscription.blogspot.com/2012/03/jumping-in-with-both-feet.html"&gt;currently&lt;/a&gt; working with FreeUKGen--the
charity behind the genealogy database &lt;a href="http://en.wikipedia.org/wiki/FreeBMD"&gt;FreeBMD&lt;/a&gt;--to build a general-purpose,
open-source tool for crowdsourced transcription of structured manuscript data into a searchable database.&lt;br /&gt;
&lt;br /&gt;
We're basing our system on the Scribe tool developed for the Citizen Science
Alliance for &lt;a href="http://whats-the-score.org/"&gt;What's the Score at the Bodleian&lt;/a&gt;,
which originated out of their experience building &lt;a href="http://oldweather.org/"&gt;OldWeather&lt;/a&gt; and other
citizen science sites.&lt;br /&gt;
&lt;br /&gt;
We are building the following systems:&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;A new tool for loading image sets into the Scribe system and attaching
them to data-entry templates.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Modifications to the Scribe system to handle our volunteer organization's
workflow, plus some usability enhancements. &lt;/li&gt;
&lt;li&gt;A publicly-accessible search-and-display website to mine the database
created through data entry.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;A reporting, monitoring, and coordinating system for our volunteer
supervisors.&amp;nbsp;&lt;/li&gt;
&lt;/ol&gt;
We also plan to add support for geocoding during transcription and GIS
support within the search and display system.  Currently, initial
development is mostly finished with 1 and moving on to 2 and 3 above.&lt;br /&gt;
&lt;br /&gt;
Although this tool is focused on support for parish registers and census
forms, we are intent on creating a general-purpose system for any
tabular/structured data.&amp;nbsp;&amp;nbsp; Scribe's data-entry templates are defined in its database, with the possibility to assign different templates to different images or sets of images.&amp;nbsp; As a result, we can use a simple template for a 1750 register of burials or a much more complex template for an 1881 census form.&amp;nbsp; Since each transcribed record is linked to the section of the page image it represents, we have the ability to display the facsimile version of a record alongside its transcript in a list of search results, or to get fancy and pre-populate a transcriber's form with frequently-repeated information like months or birthplaces.&lt;br /&gt;
&lt;br /&gt;
Under the guidance of &lt;a href="http://en.wikipedia.org/wiki/Ben_Laurie"&gt;Ben Laurie&lt;/a&gt;, the trustee directing the project, we are committed to open source and open data.&amp;nbsp; We're releasing the &lt;a href="https://github.com/FreeUKGen"&gt;source code&lt;/a&gt; under an Apache license and planning to build API access to the full set of record data.&lt;br /&gt;
&lt;br /&gt;
We feel that the more the merrier in an open-source project, so we're looking for collaborators, whether they contribute
code, funding, or advice.&amp;nbsp; We are especially interested in collaborators from archives, libraries, and the genealogy world.&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/Zje3nqDtc4I" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/10/building-structured-transcription-tool.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-3761271554340522309</guid><pubDate>Wed, 03 Oct 2012 02:19:00 +0000</pubDate><atom:updated>2013-02-26T19:55:17.612-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">press</category><title>ReportersLab Reviews FromThePage</title><description>Tyler Dukes has written a concise introduction to the issues with handwritten material and &lt;a href="http://www.reporterslab.org/handwritten-records/"&gt;a lovely review of FromThePage at ReportersLab&lt;/a&gt;:&lt;br /&gt;
&lt;blockquote class="tr_bq"&gt;
Even when physical documents are converted into digital format, 
subtle inconsistencies in handwriting prove too much for optical 
character recognition software.&amp;nbsp;The best computer scientists have been 
able to do is apply various machine learning techniques, but most of 
these require a lot of training data — accurate transcriptions 
deciphered by humans and fed into an algorithm.
&lt;br /&gt;
&lt;br /&gt;
“Fundamentally, I don’t think that we’re going to see effective OCR 
for freeform cursive any time soon,” Brumfield said. “The big successes 
so far with machine recognition have been in domains in which there’s a 
really constrained possibilities for what is written down.”
&lt;br /&gt;
&lt;br /&gt;
That means entries like numbers. Dates. Zip codes. Get beyond that, and you’re out of luck.
&lt;/blockquote&gt;
I don't know much about the world of investigative journalism, but it wouldn't surprise me if it holds as many intriguing parallels and new challenges as I've discovered among natural science collections.&amp;nbsp;&amp;nbsp; Handwriting might still be the most interdisciplinary technology.&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/VbfJmZT5sWc" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/10/reporterslab-reviews-fromthepage.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-7630844857436057563</guid><pubDate>Mon, 24 Sep 2012 19:01:00 +0000</pubDate><atom:updated>2013-02-26T19:55:52.917-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">presentations</category><category domain="http://www.blogger.com/atom/ns#">crowdsourcing</category><title>Bilateral Digitization at Digital Frontiers 2012</title><description>This is a transcript of the talk I gave at &lt;a href="https://digitalfrontiers.unt.edu/schedule"&gt;Digital Frontiers 2012&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Abstract: One of the ironies of the Internet age is that traditional standards
for accessibility have changed radically. Intelligent members of the
public refer to undigitized manuscripts held in a research library as
"locked away", even though anyone may study the well-cataloged,
well-preserved material in the library's reading room. By the
standard of 1992, institutionally-held manuscripts are far more
accessible to researchers than uncatalogued materials in private
collections -- especially when the term "private collections" includes
over-stuffed suburban filing cabinets or unopened boxes inherited from
the family archivist. In 2012, the democratization of digitization
technology may favor informal collections over institutional ones,
privileging online access over quality, completeness, preservation and
professionalism.&lt;/i&gt;&lt;br /&gt;
&lt;i&gt;&lt;br /&gt;&lt;/i&gt;
&lt;i&gt;Will the "cult of the amateur" destroy scholarly and archival
standards? Will crowdsourcing unlock a vast, previously invisible
archive of material scattered among the public for analysis by
scholars? How can we influence the headlong rush to digitize through
education and software design? This presentation will discuss the
possibilities and challenges of mass digitization for amateurs,
traditional scholars, libraries and archives, with a focus on
handwritten documents.&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-CzNmUZpKgUU/UGCE2xJkuSI/AAAAAAAAAXY/cYqDX8GYYKM/s1600/Slide1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-CzNmUZpKgUU/UGCE2xJkuSI/AAAAAAAAAXY/cYqDX8GYYKM/s320/Slide1.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
My presentation is on bilateral digitzation: digitization done by
institutions and by individuals outside of institutions and the wall
that's sort of in between institutions and individuals.
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-bmgmI0UqZ4w/UGCFC3JdYHI/AAAAAAAAAYw/v3dOw7t-W4M/s1600/Slide2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-bmgmI0UqZ4w/UGCFC3JdYHI/AAAAAAAAAYw/v3dOw7t-W4M/s320/Slide2.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
In 1823, a young man named Jeremiah White Graves moved to Pittsylvania
County, Virginia and started working as a clerk in a country store.
Also that year he started recording a diary and journal of his
experiences. He maintained this diary for the next fifty-five years, so
it covers his experience -- his rise to become a relatively prominent
landowner, tobacco farmer, and slaveholder. It covers the Civil War, it
covers Reconstruction and the aftermath. (This is an entry covering
Lee's surrender.)
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-yJl_ieDj62I/UGCFSSgYudI/AAAAAAAAAaI/bLrMxx84Zcc/s1600/Slide3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-yJl_ieDj62I/UGCFSSgYudI/AAAAAAAAAaI/bLrMxx84Zcc/s320/Slide3.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
In addition to the diary, he kept account books that give you details of
plantation life that range from -- that you wouldn't otherwise see in
the diaries.

So for example, this is his daughter Fanny,

&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-gbXQ_Yn0O1g/UGCFcTq1kpI/AAAAAAAAAbQ/4VFW7XRN478/s1600/Slide4.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-gbXQ_Yn0O1g/UGCFcTq1kpI/AAAAAAAAAbQ/4VFW7XRN478/s320/Slide4.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
And this is a &lt;a href="http://archive.org/stream/Jeremiah_White_Graves_Diary_Volume_2_Book_01/JWGravesVol2Book01#page/n15/mode/1up"&gt;list of every single article of clothing&lt;/a&gt; that she took with her when she went off to a
boarding school for a semester.

&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-fIzcDhac15A/UGCFeIbVklI/AAAAAAAAAbY/MyuxdR4gPd4/s1600/Slide5.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-fIzcDhac15A/UGCFeIbVklI/AAAAAAAAAbY/MyuxdR4gPd4/s320/Slide5.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Perhaps more interesting, this is a &lt;a href="http://archive.org/stream/Jeremiah_White_Graves_Diary_Volume_2_Book_01/JWGravesVol2Book01#page/n18/mode/1up"&gt;memorandum of cash payments&lt;/a&gt; that he
made to certain of his enslaved laborers for work on their customary
holidays -- another sort of interesting factor.

I got interested in this because I'm interested in the property that he
lived in. The house that he built is now in my family, and I was doing
some research on this. Since these account books include details of
construction of the house, I spent a lot of time looking for these
books. I've been looking for them for about the last ten years.

I got in contact with some of the descendants of Jeremiah White Graves
and found out through them that one of their ancestors had donated the
diaries to the Alderman Library at the University of Virginia. I looked
into getting them digitized and tried to get some collaboration [going]
with some of the descendants, and one of them in particular, Alan
Williams, was extremely helpful to me. But this was his reaction:

&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-pqcnve5SnZw/UGCFexUsnRI/AAAAAAAAAbg/KcArRPvwWrk/s1600/Slide6.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-pqcnve5SnZw/UGCFexUsnRI/AAAAAAAAAbg/KcArRPvwWrk/s320/Slide6.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Okay. So we have diaries that are put in a library -- I believe one of
the top research libraries in the country -- and they are behind a wall.
They are locked away from him.
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-uENJ3EWeaEI/UGCFfLqRHHI/AAAAAAAAAbo/eurgo_PEYLQ/s1600/Slide7.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-uENJ3EWeaEI/UGCFfLqRHHI/AAAAAAAAAbo/eurgo_PEYLQ/s320/Slide7.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So let's talk about walls. From his perspective, the fact that these
diaries--these family manuscripts of his--are in the Alderman Library
means:&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;They're professionally conserved -- great!&amp;nbsp;&lt;/li&gt;
&lt;li&gt;They're publicly accessible, so anyone can walk in and look at them in
the Reading Room.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;They're cataloged, which would not be the case if they'd still been
sitting in his family.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;On the down side, they're a thousand miles away: they're in Virginia,
he's in Florida, I'm in Texas. We all want to look at these, but it's
awfully hard for people to get there if we don't have research budgets.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;We have to deal with reading room restrictions if we actually get there.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;Once we work on getting things digitized we have these
permission-to-publish that we need to deal with, which have some moral
challenges for someone from whose family these diaries came from.&amp;nbsp;&lt;/li&gt;
&lt;li&gt;And we have the scanning fees: the cost of getting them scanned by the
excellent digitization department at the Alderman Library is a thousand
dollars. Which is not unreasonable, but it's still pretty costly.
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-bu0cE2LdeDw/UGCFfve5X2I/AAAAAAAAAbw/1YVdRWKe4SI/s1600/Slide8.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-bu0cE2LdeDw/UGCFfve5X2I/AAAAAAAAAbw/1YVdRWKe4SI/s320/Slide8.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So here's a wall--a real, physical wall--between this institution and
the public. How do we get through walls? Everyone here is familiar
with digitization and collaboration. This is how we share things
nowadays. It's how we've been sharing things for the last fifteen
years, in fact. But, at least fifteen years ago, when we got started
doing digitization, we had shallow digitization. 
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-m8iOhXosdzU/UGCFgHaDBmI/AAAAAAAAAb4/pL31XWaU0L4/s1600/Slide9.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-m8iOhXosdzU/UGCFgHaDBmI/AAAAAAAAAb4/pL31XWaU0L4/s320/Slide9.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
The prevalent practice in institutions was "scan-and-dump": make some
scans, put them in a repository online.&lt;br /&gt;
&lt;br /&gt;
One of the problems with that is that you have very limited metadata.
The metadata is usually institutionally-oriented. No transcripts, in
particular -- nobody has time for this. And quite often, they're in
software platforms that are not crawlable by search engines.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-yjX0kew9a_4/UGCE3TFtZ1I/AAAAAAAAAXg/XtW-NnRoVLw/s1600/Slide10.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-yjX0kew9a_4/UGCE3TFtZ1I/AAAAAAAAAXg/XtW-NnRoVLw/s320/Slide10.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Now meanwhile, amateurs are digitizing things, and they're doing
something that's actually even worse! They &lt;i&gt;are&lt;/i&gt; producing full
transcripts, but they're not attaching them to any facsimiles. They're
not including any provenance information or information about where
their sources came from. Their editorial decisions about expanding
abbreviations or any other sorts of modernizations or things like that
-- they're invisible; none of those are documented.&lt;br /&gt;
&lt;br /&gt;
Worst of all, however, is that the way that these things are propagated
through the Internet is through cut-and-paste: so quite often from a
website to a newsgroup to emails, you can't even find the original
person who typed up whatever the source material was.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-aMuad2M6dIk/UGCE3xv6n3I/AAAAAAAAAXo/Etlfmks848s/s1600/Slide11.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-aMuad2M6dIk/UGCE3xv6n3I/AAAAAAAAAXo/Etlfmks848s/s320/Slide11.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So how do we get to deep digitization and solve both of these problems?
&lt;br /&gt;
&lt;br /&gt;
The challenges to institutions, in my opinion, come down to funding and
manpower. As we just mentioned, generally archives don't have a staff
of people ready to produce documentary editions and put them online.&lt;br /&gt;
&lt;br /&gt;
Outside institutions, the big challenge is standards; it is
expertise. You've got manpower, you've got willingness, but you've got
a lot of trouble making things work using the sorts of methodologies
that have come out of the scholarly world and have been developed over
the last hundred years.&lt;br /&gt;
&lt;br /&gt;
So how do we fix these challenges?
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-JjY6T9gbbrA/UGCE4UxH8pI/AAAAAAAAAXw/HI30mE_q9t0/s1600/Slide12.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-JjY6T9gbbrA/UGCE4UxH8pI/AAAAAAAAAXw/HI30mE_q9t0/s320/Slide12.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
One possible solution for institutions is crowdsourcing.&amp;nbsp; We've talked
about this this morning; we don't need to go into detail about what
crowdsourcing is, but I'd like to talk a little bit about who
participates in crowdsourcing projects and what kinds of things they can
do and what this says about [crowdsourcing projects]. I've got three
examples here. &lt;a href="http://oldweather.org/"&gt;OldWeather.org&lt;/a&gt; is a project from GalaxyZoo, the
Zooniverse/&lt;a href="http://www.citizensciencealliance.org/"&gt;Citizen Science Alliance&lt;/a&gt;. The &lt;a href="http://beta.fromthepage.com/ZenasMatthews"&gt;Zenas Matthews Diary&lt;/a&gt; was
something that I collaborated with the &lt;a href="http://www.southwestern.edu/live/news/6475-collaborative-transcription-project"&gt;Southwestern University Smith Library Special Collections&lt;/a&gt; on. And the &lt;a href="http://www.facebook.com/HarryRansomCenterFragments"&gt;Harry Ransom Center's Manuscript Fragments Project&lt;/a&gt;.
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-bITS5eV2T7E/UGCE6ItwGlI/AAAAAAAAAX4/crhXBO24c-8/s1600/Slide13.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-bITS5eV2T7E/UGCE6ItwGlI/AAAAAAAAAX4/crhXBO24c-8/s320/Slide13.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Okay, in Old Weather there are Royal Navy logbooks that record
temperature measurements every four hours: the midshipman of the watch
would come out on deck and record barometric pressure, wind speed, wind
direction and temperature. This is of incredible importance for climate
scientists because you cannot point a weather satellite at the south
Pacific in 1916. The problem is that it's all handwritten and you need
humans to transcribe this.
&lt;br /&gt;
&lt;br /&gt;
They launched this project three years ago, I believe, and they're done.
They've transcribed all the Royal Navy logs from the period essentially
around World War I -- all in triplicate. So blind triple keying every
record. And the results are pretty impressive. 
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-12rAFCL-Msk/UGCE6s3LbrI/AAAAAAAAAYA/fruLw0y1Lcc/s1600/Slide14.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-12rAFCL-Msk/UGCE6s3LbrI/AAAAAAAAAYA/fruLw0y1Lcc/s320/Slide14.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Each individual volunteer's transcripts tend to be &lt;a href="http://blog.oldweather.org/2011/03/31/better-than-the-defence/"&gt;about 97% accurate&lt;/a&gt;.
For every thousand logbook entries, three entries are going to be wrong
because of volunteer error. But this compares pretty favorably with the
ten that are actually honestly illegible, or indeed the three that are
the result of the midshipman of the watch confusing north and south.
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-dV5Sb6ib2N4/UGCE7tngRjI/AAAAAAAAAYQ/t8FTZbKfHm0/s1600/Slide16.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-dV5Sb6ib2N4/UGCE7tngRjI/AAAAAAAAAYQ/t8FTZbKfHm0/s320/Slide16.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So in terms of participation, OldWeather has gotten transcribed &lt;a href="http://blog.oldweather.org/2012/07/23/one-million-six-hundred-thousand-new-observations/"&gt;more than 1.6 million weather observations&lt;/a&gt;--again, all triple-keyed--through
the efforts of sixteen thousand volunteers who've been transcribing
pages from a million pages of logs.
&lt;br /&gt;
&lt;br /&gt;
So what this means is that you have a mean contribution of one hundred
transcriptions per user. &lt;b&gt;But that statistic is worthless!&lt;/b&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-oemNHp65RWA/UGCE9Mgo_YI/AAAAAAAAAYY/_q5fxiTwuqc/s1600/Slide17.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-oemNHp65RWA/UGCE9Mgo_YI/AAAAAAAAAYY/_q5fxiTwuqc/s320/Slide17.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Because you don't have individual volunteers transcribing one hundred
things apiece. You don't have an even distribution. This is a &lt;a href="http://blog.oldweather.org/2012/09/05/theres-a-green-one-and-a-pink-one-and-a-blue-one-and-a-yellow-one/"&gt;color map of contributions per user&lt;/a&gt;. Each user has a square. The size of the
square represents the quantity of records that they transcribed. And
what you can see here is that of those 1.6 million records, fully a
tenth (in the left-hand column) were transcribed by only ten users. 
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-iS1wqdCLOF8/UGCE9_U2bmI/AAAAAAAAAYc/nGxCyfzzol4/s1600/Slide18.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-iS1wqdCLOF8/UGCE9_U2bmI/AAAAAAAAAYc/nGxCyfzzol4/s320/Slide18.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So we see this in other projects. This is a power-law distribution in
which most of the contributions are made by a hand-full of
"well-informed enthusiasts". I've &lt;a href="http://manuscripttranscription.blogspot.com/2012/03/crowdsourcing-at-imls-webwise-2012.html"&gt;talked elsewhere&lt;/a&gt; about how this is
true in small projects as well. What I'd like to talk about here is
some of the implications.
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-oXhUifuPQo0/UGCFASFjJNI/AAAAAAAAAYo/Yvu42T9ogLI/s1600/Slide19.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-oXhUifuPQo0/UGCFASFjJNI/AAAAAAAAAYo/Yvu42T9ogLI/s320/Slide19.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
One of the implications is that very small projects can work: This is
the Zenas Matthews Diaries that were transcribed on FromThePage by one
single volunteer -- one well-informed enthusiast in fourteen days. 
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-GQkLhtlbU0c/UGCFDTE3KGI/AAAAAAAAAY4/CE-yNiVWucM/s1600/Slide20.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-GQkLhtlbU0c/UGCFDTE3KGI/AAAAAAAAAY4/CE-yNiVWucM/s320/Slide20.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Before we had announced the project publicly he found it, transcribed
the entire 43-page diary from the Mexican-American War of a Texas
volunteer, went back and made two hundred and fifty revisions to those
pages, and added two dozen footnotes.
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-HKmf6-UH4b8/UGCFGW2oxhI/AAAAAAAAAZA/cDhgcVWKGUE/s1600/Slide21.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-HKmf6-UH4b8/UGCFGW2oxhI/AAAAAAAAAZA/cDhgcVWKGUE/s320/Slide21.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
This also has implications for the kinds of tasks you can ask volunteers
to do. This is the Harry Ransom Center Manuscript Fragments Project in
which the Ransom Center has a number of fragments of medieval
manuscripts that were later used in binding for later works, and they're
asking people to identify them so that perhaps they can reassemble them. 
&lt;br /&gt;
&lt;br /&gt;
So here's &lt;a href="http://www.flickr.com/photos/ransom_center_fragments/7646144356/in/photostream"&gt;a posting on Flickr&lt;/a&gt;. They're saying, "Please identify this in
the comments thread."
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-4LQIiyWOhqk/UGCFHfuAt2I/AAAAAAAAAZI/1Wv7XIxQ9dQ/s1600/Slide22.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-4LQIiyWOhqk/UGCFHfuAt2I/AAAAAAAAAZI/1Wv7XIxQ9dQ/s320/Slide22.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
And look: we've got people volunteering
transcriptions of exactly what this is: identifying, "Hey, this is the
&lt;i&gt;Digest of Justinian&lt;/i&gt;, oh, and this is where you can go find this." 
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-y9xCp2PA_10/UGCFI9PSnZI/AAAAAAAAAZQ/9Lhoy-ZfQZU/s1600/Slide23.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-y9xCp2PA_10/UGCFI9PSnZI/AAAAAAAAAZQ/9Lhoy-ZfQZU/s320/Slide23.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
This is true even for smaller, more difficult fragments. &lt;a href="http://www.flickr.com/photos/ransom_center_fragments/7882835530/in/photostream"&gt;Here&lt;/a&gt; we have
one user going through and identifying just the left hand fragment of
this chunk of manuscript that was used for binding.

&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-PiSySviWeFc/UGCFJTNMGsI/AAAAAAAAAZY/IKIrQApWHag/s1600/Slide24.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-PiSySviWeFc/UGCFJTNMGsI/AAAAAAAAAZY/IKIrQApWHag/s320/Slide24.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So crowdsourcing and deep digitization has a virtuous cycle in my opinion. You go through and you try to engage volunteers to come do this kind of work. That generates deep digitization which means that these resources are findable. And because they're findable, you can find more volunteers.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-uHs1PYgIQqI/UGCFKWFv29I/AAAAAAAAAZo/clcwaBr1is0/s1600/Slide26.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-uHs1PYgIQqI/UGCFKWFv29I/AAAAAAAAAZo/clcwaBr1is0/s320/Slide26.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
I've had this happen recently with a personal project, &lt;a href="http://beta.fromthepage.com/JuliaBrumfield?ol=s_sp_diaries"&gt;transcribing my great-great grandmother's diary&lt;/a&gt;. The current top volunteer on this is a
man named Nat Wooding. He's a retired data analyst from Halifax County,
Virginia. He's transcribed a hundred pages and indexed them in six
months. He has no relationship whatsoever to the diarist.&lt;br /&gt;
&lt;br /&gt;
But his great uncle was the postman who's mentioned in the diaries, and
once we had a few pages worth of transcripts done, he went online and
did a vanity search for "Nat Wooding", found the postman--&lt;a href="http://beta.fromthepage.com/display/search?article_id=7268"&gt;also named Nat Wooding&lt;/a&gt;--discovered that that was his great uncle and has become a
volunteer.
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-qsNhbsyTE3c/UGCFMZfNP7I/AAAAAAAAAZw/zU7zbCbR3Wk/s1600/Slide27.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-qsNhbsyTE3c/UGCFMZfNP7I/AAAAAAAAAZw/zU7zbCbR3Wk/s320/Slide27.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Here's the example: this is just a scan/facsimile. Google can't read this.

&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-fkv9XT2jzVU/UGCFQFnIfDI/AAAAAAAAAZ4/aaWbynuT0qc/s1600/Slide28.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-fkv9XT2jzVU/UGCFQFnIfDI/AAAAAAAAAZ4/aaWbynuT0qc/s320/Slide28.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Google can read this, and find Nat Wooding.
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-ye363cDL9Bg/UGCFQ0gr0vI/AAAAAAAAAaA/1xvd2eU5ojs/s1600/Slide29.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-ye363cDL9Bg/UGCFQ0gr0vI/AAAAAAAAAaA/1xvd2eU5ojs/s320/Slide29.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Now I'd like to turn to non-institutional digitization. I said
"bilateral" -- this means, what happens when the public initiates
digitization efforts. What are the challenges--I mentioned standards--how can we fix those. And why is this important?

&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-eWHS2bosbrA/UGCFTYaDhBI/AAAAAAAAAaQ/11NnvUZiLIE/s1600/Slide30.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-eWHS2bosbrA/UGCFTYaDhBI/AAAAAAAAAaQ/11NnvUZiLIE/s320/Slide30.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Well, there is this--what I call the Invisible Archive, of privately
held materials throughout the country and indeed the world. And most of
it is not held by private collectors that are wealthy, like private art
collectors. They are someone's great aunt who has things stashed away
in filing cabinets in her basement. Or worse, they are the heirs of
that great aunt, who aren't interested and have them stuck in boxes in
their attic. We have primary sources here of non-notable subjects, that
are very hard to study because you can't get at them.&lt;br /&gt;
&lt;br /&gt;
But this is a problem that has been solved, outside of manuscripts.
It's been solved with photographs. It's been solved by Flickr.
Nowadays, if you want to find photographs of African-American girls from
the 1960s on tricycles, you can find them on Flickr. Twenty years ago,
this was something that was irretrievable. So Flickr is a good example,
and I'd like to use it to describe how we might be able to apply it to
other fields.


&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-obgDUJt3vaE/UGCFUUxctAI/AAAAAAAAAaY/tz3fEEGAdv0/s1600/Slide31.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-obgDUJt3vaE/UGCFUUxctAI/AAAAAAAAAaY/tz3fEEGAdv0/s320/Slide31.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So, in terms of solving the standards problem, amateur digitization has
a bad, bad reputation, as you can see &lt;a href="http://h-net.msu.edu/cgi-bin/logbrowse.pl?trx=vx&amp;amp;list=h-shear&amp;amp;month=1104&amp;amp;week=d&amp;amp;msg=l2ONl4O/8Ue726xY8RcF3Q&amp;amp;user=&amp;amp;pw="&gt;here&lt;/a&gt;.

&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-xo7-GcpzYvA/UGCFWSq394I/AAAAAAAAAag/zGc8tTEbkQc/s1600/Slide32.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-xo7-GcpzYvA/UGCFWSq394I/AAAAAAAAAag/zGc8tTEbkQc/s320/Slide32.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
And much of that bad
reputation is deserved, and isn't specific to digitization. This &lt;a href="http://h-net.msu.edu/cgi-bin/logbrowse.pl?trx=vx&amp;amp;list=h-oieahc&amp;amp;month=9603&amp;amp;week=b&amp;amp;msg=En68eokB6RWMfh5DnMN2%2Bw&amp;amp;user=&amp;amp;pw="&gt;has been a problem with print editions in the past&lt;/a&gt;, it is a problem online
now. Frankly, scholars don't trust the materials because they're not up
to standard.

&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-aKeg-2g-Pgg/UGCFXH3RbcI/AAAAAAAAAao/c-1MwjuSB1U/s1600/Slide33.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-aKeg-2g-Pgg/UGCFXH3RbcI/AAAAAAAAAao/c-1MwjuSB1U/s320/Slide33.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
How do we solve this? Collaboration: we'd like to see more
participation from people who are scholars, who are trained archivists,
who are trained librarians to participate in some of these projects. 

&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/--oqW-ez9DVQ/UGCFYkZCMdI/AAAAAAAAAaw/rDG9Jx-qrUY/s1600/Slide34.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/--oqW-ez9DVQ/UGCFYkZCMdI/AAAAAAAAAaw/rDG9Jx-qrUY/s320/Slide34.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
One of the ones I'm working with [is] digitizing these registers from the
Reformation up to the present. We're building this generalizable,
open-source, crowdsourced transcription tool and indexing tool for
structured data. We'd love to find archivists to tell us what to do,
what not to do, and to collaborate with us on this.
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-ORc28vnhf5w/UGCFZAYYlQI/AAAAAAAAAa4/z3MCiuMsgBM/s1600/Slide35.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-ORc28vnhf5w/UGCFZAYYlQI/AAAAAAAAAa4/z3MCiuMsgBM/s320/Slide35.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Another solution is community. You don't go on Flickr just to share
your photos; you go on Flickr to learn to become a better photographer.
And I think that creating platforms and creating communities that can
come up with these standards and enforce them among themselves can
really help.
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-2RBS9ynlAIM/UGCFZjPWF5I/AAAAAAAAAbA/5w6vFU0iE14/s1600/Slide36.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-2RBS9ynlAIM/UGCFZjPWF5I/AAAAAAAAAbA/5w6vFU0iE14/s320/Slide36.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
The same thing is true with software platforms, if they actually prompt
users and say: "when you're uploading this image, tell us about the
provenance." "Maybe you might want to scan the frontispieces." "Maybe you'd like to tell us the history of ownership." 
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-fad1XUkBJbA/UGCFZ0qXZoI/AAAAAAAAAbI/TJSzbwpd7F0/s1600/Slide37.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-fad1XUkBJbA/UGCFZ0qXZoI/AAAAAAAAAbI/TJSzbwpd7F0/s320/Slide37.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Those are the things that I think might get us there. I've just hit my
time limit, I think, so thanks a lot!&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Ben Brumfield is a family historian and independent software engineer. For the
last seven years he has been developing &lt;a href="http://fromthepage.com/"&gt;FromThePage&lt;/a&gt;, an open source
manuscript transcription tool in use by libraries, museums, and family
historians. He is currently working with FreeUKGen to create an open source system for indexing images and transcribing structured, hand-written material&lt;/i&gt;.&lt;i&gt; Contact Ben at benwbrum@gmail.com.&lt;/i&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/-TotGL9BfHs" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/09/bilateral-digitization-at-digital.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/-CzNmUZpKgUU/UGCE2xJkuSI/AAAAAAAAAXY/cYqDX8GYYKM/s72-c/Slide1.PNG" height="72" width="72" /><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-4672035822854666763</guid><pubDate>Wed, 25 Jul 2012 16:53:00 +0000</pubDate><atom:updated>2013-02-26T19:56:39.069-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">presentations</category><category domain="http://www.blogger.com/atom/ns#">rails</category><title>RMagick Lightning Talk at Austin On Rails</title><description>This is a transcript of my five-minute lightning talk at the July meeting of &lt;a href="http://austinonrails.org/"&gt;Austin On Rails&lt;/a&gt;, the local Ruby on Rails user group. It highlights a few things I've learned from my work on &lt;a href="http://github.com/FreeUKGen/MyopicVicar"&gt;MyopicVicar &lt;/a&gt;and &lt;a href="https://github.com/benwbrum/autosplit"&gt;Autosplit&lt;/a&gt;.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-7ROXZ3p0Orw/UBAXE6-M6FI/AAAAAAAAASM/eaChLZEvOOo/s1600/Slide1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-7ROXZ3p0Orw/UBAXE6-M6FI/AAAAAAAAASM/eaChLZEvOOo/s320/Slide1.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Hi everyone, I'm Ben Brumfield and I've been a professional Ruby on 
Rails developer for three whole months, [applause] and I am here to talk
 about RMagick!&amp;nbsp; RMagick is a wrapper for ImageMagick, which is great 
for processing images.&amp;nbsp; These are the kinds of images that I process:&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-Iqtd7xp3Csc/UBAXuxHHMcI/AAAAAAAAATU/I9okMb-N_Mg/s1600/Slide2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-Iqtd7xp3Csc/UBAXuxHHMcI/AAAAAAAAATU/I9okMb-N_Mg/s320/Slide2.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
(I write software that helps people transcribe old handwriting.)&lt;br /&gt;
&lt;br /&gt;
Now
 is this a pretty image?&amp;nbsp; I don't know.&amp;nbsp; We need to do some things with 
this, and this is how RMagick has helped me solve some of my problems.&amp;nbsp; 
One of the things we need to do is to turn it into a thumbnail.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-B5lTK_Cd6pk/UBAXynu0W9I/AAAAAAAAATc/DoDIKDd6Rtg/s1600/Slide3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-B5lTK_Cd6pk/UBAXynu0W9I/AAAAAAAAATc/DoDIKDd6Rtg/s320/Slide3.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
This is an example of RMagic used as a very simple wrapper for 
ImageMagick: All we're doing is creating an ImageList out of that file, 
we say &lt;a href="http://www.imagemagick.org/RMagick/doc/image3.html#thumbnail"&gt;image.thumbnail&lt;/a&gt; and pass it a scaling factor, and that will very 
quickly go through and transform that image. Write it to a file and 
we're done.&lt;br /&gt;
&lt;br /&gt;
So that's kind of basic.&amp;nbsp; RMagick does a 
lot of other basic kinds of things like this, like &lt;a href="http://www.imagemagick.org/RMagick/doc/image3.html#rotate"&gt;rotate &lt;/a&gt;or &lt;a href="http://www.imagemagick.org/RMagick/doc/image2.html#negate"&gt;negate&lt;/a&gt;--sometimes I get files that are negatives--things like that.&lt;br /&gt;
&lt;br /&gt;
What about the cool stuff?&amp;nbsp; Here's an example of something that I have to deal with.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-WV89yuhrg9s/UBAX0vbBpcI/AAAAAAAAATk/Yz9cLtxADpw/s1600/Slide4.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-WV89yuhrg9s/UBAX0vbBpcI/AAAAAAAAATk/Yz9cLtxADpw/s320/Slide4.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
This is a parish registry from 1810, which would be fine, but this is
 a scan of a bad microfilm of a parish registry from 1810 that is 
tilted.&amp;nbsp; Now my users need to draw rectangles around individual lines, 
and the tilt is going to really throw them off.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-zo9NUrh1GVQ/UBAX2eV-oqI/AAAAAAAAATs/LyGTc6yCOjk/s1600/Slide5.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-zo9NUrh1GVQ/UBAX2eV-oqI/AAAAAAAAATs/LyGTc6yCOjk/s320/Slide5.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
How badly is this tilted? If you're interested in seeing some RMagick
 code, here's a script that I wrote to draw a grid over that image or 
any other image: &lt;a href="http://tinyurl.com/GridifyGist"&gt;http://tinyurl.com/GridifyGist&lt;/a&gt; . And what you'll see 
is:&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-00zDbc8MXac/UBAX6VqPLrI/AAAAAAAAAT0/qFSrAT-B-lc/s1600/Slide6.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-00zDbc8MXac/UBAX6VqPLrI/AAAAAAAAAT0/qFSrAT-B-lc/s320/Slide6.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Man, it kind of sucks!&amp;nbsp; You draw a rectangle over that and you're going to get all kinds of weird stuff. &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-Xyj1OAs36d8/UBAX8tWalBI/AAAAAAAAAT8/PB6sQ3m6Omw/s1600/Slide7.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-Xyj1OAs36d8/UBAX8tWalBI/AAAAAAAAAT8/PB6sQ3m6Omw/s320/Slide7.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Enter &lt;a href="http://www.imagemagick.org/RMagick/doc/image1.html#deskew"&gt;deskew&lt;/a&gt;.&amp;nbsp; So here's an example of ruby code, very simple:&lt;br /&gt;
&lt;br /&gt;
&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;
&lt;span style="font-size: x-small;"&gt;require 'RMagick' # Which is very idiosyncratically capital-R capital-M&lt;/span&gt;&lt;/div&gt;
&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;
&lt;span style="font-size: x-small;"&gt;s = Magick::ImageList.new('skew.jpg') # We get a new ImageList from skew.jpg&lt;/span&gt;&lt;/div&gt;
&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;
&lt;span style="font-size: x-small;"&gt;d = s.deskew # And poof!&amp;nbsp; Zap!&lt;/span&gt;&lt;/div&gt;
&lt;div style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;
&lt;span style="font-size: x-small;"&gt;d.write('deskew.jpg') # We write out deskew and get this:&lt;/span&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-_uHyaDUMQ5w/UBAYA_p84rI/AAAAAAAAAUE/LQ4epzm99JM/s1600/Slide8.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-_uHyaDUMQ5w/UBAYA_p84rI/AAAAAAAAAUE/LQ4epzm99JM/s320/Slide8.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Now this looks kind of weird. The image looks off.&amp;nbsp; But the image's &lt;i&gt;contents &lt;/i&gt;are not off.&amp;nbsp; And if we throw the grid on it:&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-BD_SfqbzCBQ/UBAYEZ7OK5I/AAAAAAAAAUM/W88rTmW7pEU/s1600/Slide9.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-BD_SfqbzCBQ/UBAYEZ7OK5I/AAAAAAAAAUM/W88rTmW7pEU/s320/Slide9.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
You can see that what RMagick has done is that it's gone through and 
it's looked for lines inside the content of the image, and it's rotated 
the image in correspondence to where it thinks the lines go and where it
 thinks the orientation is.&lt;br /&gt;
&lt;br /&gt;
I think that's pretty slick.&lt;br /&gt;
&lt;br /&gt;
RMagick
 can do some other slick things, and this is where you get into the 
programming aspects of RMagick and why you'd use RMagick instead of just
 calling ImageMagick from the command line.&lt;br /&gt;
&lt;a href="http://2.bp.blogspot.com/-FspyNHoyblU/UBAXHn-DLfI/AAAAAAAAASU/fgMocT-A_bM/s1600/Slide10.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-FspyNHoyblU/UBAXHn-DLfI/AAAAAAAAASU/fgMocT-A_bM/s320/Slide10.PNG" width="320" /&gt;&lt;/a&gt;&lt;br /&gt;
Here what we're doing is extracting files from a PDF.&amp;nbsp; Lots of 
scanners produce images in PDF format, that is just using PDF as a 
container for a bunch of images. &lt;br /&gt;
&lt;br /&gt;
This goes through, 
creates an ImageList from the file, and then goes through for each image
 in the image list--actually more specifically for each page in that 
PDF--and it writes it out correctly.&amp;nbsp; (RMagick may not be the right 
thing to user for some files because it's going to get page by page 
images -- if you have PDFs that are composed of a bunch of images [on 
each page] then you're going to want to use some other tools.)&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-h9cAnKaGoB8/UBAXNVHG_mI/AAAAAAAAASc/hu3w75OadRg/s1600/Slide11.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-h9cAnKaGoB8/UBAXNVHG_mI/AAAAAAAAASc/hu3w75OadRg/s320/Slide11.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Okay.&amp;nbsp; So here's a file that I'm dealing with, and when I want to 
present it to my users I actually want to present them with a single 
page.&amp;nbsp; Is this file a single page? No -- this file is two pages, all 
scanned on a flatbed scanner.&amp;nbsp; (For those that are curious, it's a 
sixteenth-century Spanish legal document; I don't know what it says.)&amp;nbsp; 
But how do I find the spine?&amp;nbsp; How do I know where to split it?&amp;nbsp; It's 
easy enough to go through and do cropping, but you know, what do we do?&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-vWr5OVrfKtQ/UBAXO9p24jI/AAAAAAAAASk/DM_V4BXP0h8/s1600/Slide12.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-vWr5OVrfKtQ/UBAXO9p24jI/AAAAAAAAASk/DM_V4BXP0h8/s320/Slide12.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
So I came up with this idea: Let's look for vertical dark stripes.&amp;nbsp; 
Let's look for the darkest strip that is vertical in a deskewed version 
of this image and see if we can identify that.&amp;nbsp; So this is something we 
can do.&amp;nbsp; What I've done here is I've said-- this is the inside of a loop
 where I've said for each &lt;i&gt;x&lt;/i&gt; let's &lt;a href="http://www.imagemagick.org/RMagick/doc/image2.html#get_pixels"&gt;pull all the pixels out&lt;/a&gt;, and then come
 up with a total brightness for that image [stripe].&amp;nbsp; Then later on, I'm
 going to find the minimum brightness for those vertical stripes.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-8DnZQ22qGEw/UBAXVwJaNRI/AAAAAAAAASs/ZMuqvQUJG58/s1600/Slide13.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-8DnZQ22qGEw/UBAXVwJaNRI/AAAAAAAAASs/ZMuqvQUJG58/s320/Slide13.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
If I do this on some of the files and indicate that stripe by a red line--which I hope you can see--it did pretty well on this!&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-k3Z4HK-9mU0/UBAXZULsp2I/AAAAAAAAAS0/2Pk5QsYWRVM/s1600/Slide14.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-k3Z4HK-9mU0/UBAXZULsp2I/AAAAAAAAAS0/2Pk5QsYWRVM/s320/Slide14.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&amp;nbsp;It does well on this awful scan of a microfilm, although the red line is hard to see at this resolution.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-PvTUYLWg86Q/UBAXgCzzFeI/AAAAAAAAAS8/X_AfU72WYsw/s1600/Slide15.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-PvTUYLWg86Q/UBAXgCzzFeI/AAAAAAAAAS8/X_AfU72WYsw/s320/Slide15.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&amp;nbsp;And wow, it does great on this piece!&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-fkAwXPOUdWE/UBAXjdxBTjI/AAAAAAAAATE/AXTEmTOQYAU/s1600/Slide16.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-fkAwXPOUdWE/UBAXjdxBTjI/AAAAAAAAATE/AXTEmTOQYAU/s320/Slide16.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
This is just an example of using RMagick to solve my problems.&amp;nbsp; After
 I've gotten the line I want to &lt;a href="http://www.imagemagick.org/RMagick/doc/image1.html#crop"&gt;crop &lt;/a&gt;it, so left page/right page&lt;br /&gt;
&lt;br /&gt;
The
 only caveats that I'd give you about RMagick is that I find it 
necessary to call GC.start a lot -- at least I did in Ruby 1 6 [ed: 
1.8.6] -- because RMagic--I don't know, man--because it swaps out the 
garbage collector or something and you run out of memory really fast.&lt;br /&gt;
&lt;br /&gt;
RMagick: I love it!&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/9vuJEIU6ZeA" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/07/rmagick-lightning-talk-at-austin-on.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://3.bp.blogspot.com/-7ROXZ3p0Orw/UBAXE6-M6FI/AAAAAAAAASM/eaChLZEvOOo/s72-c/Slide1.PNG" height="72" width="72" /><thr:total>3</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-6542935397722814956</guid><pubDate>Thu, 31 May 2012 20:29:00 +0000</pubDate><atom:updated>2013-02-26T19:57:02.092-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">crowdsourcing</category><title>Survey on Crowdsourced Transcription Tools</title><description>Via Twitter and the &lt;a href="http://tei-l.970651.n3.nabble.com/Survey-on-Transcription-Tools-and-Digital-Editions-td4014814.html"&gt;TEI-L mailing list&lt;/a&gt;, I see that Jens Brokfeld, a graduate student in Potsdam, is &lt;a href="http://hsozkult.geschichte.hu-berlin.de/forum/type=nachrichten&amp;amp;id=1779?utm_source=dlvr.it&amp;amp;utm_medium=twitter"&gt;conducting a survey&lt;/a&gt; of projects using transcription tools for his thesis, “Creating Digital Editions with Crowdsourced Manuscript Transcription: A Tool Evaluation”.&amp;nbsp; I encourage readers of this blog to take the survey and help advance the field.&lt;br /&gt;
&lt;blockquote class="tr_bq"&gt;
In case that your organisation (archives, libraries, scientific institutions) works on projects for digital editions or may do so in the future, I would be very grateful for your answers to my survey questions. Please forward this e-mail to anyone working with digital editions and transcription tools.

The survey will take 15-20 minutes.&lt;br /&gt;
&lt;br /&gt;
&lt;div class="MsoNormal"&gt;
Please click on the following link to complete the survey: &lt;span lang="EN-US" style="mso-ansi-language: EN-US;"&gt;&lt;a class="moz-txt-link-freetext" href="https://www.surveymonkey.com/s/survey_transcription_tools" rel="nofollow" target="_top"&gt;https://www.surveymonkey.com/s/survey_transcription_tools&lt;/a&gt;&lt;/span&gt;&lt;/div&gt;
&lt;/blockquote&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/u9kTR6PAjvM" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/05/survey-on-crowdsourced-transcription.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-8366263997716504002</guid><pubDate>Fri, 25 May 2012 15:45:00 +0000</pubDate><atom:updated>2013-02-26T20:01:52.659-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">presentations</category><category domain="http://www.blogger.com/atom/ns#">crowdsourcing</category><title>Transcription Tools at TCDL 2012</title><description>Yesterday I presented a guide to choosing software for crowdsourced manuscript transcription at the Texas Conference on Digital Libraries.  Here are the slides from that talk:&lt;br /&gt;
&lt;iframe allowfullscreen="true" frameborder="0" height="389" mozallowfullscreen="true" src="https://docs.google.com/presentation/embed?id=1Zlf3i5o2c-HNlhL-QfO2-9TmmCHhDPl0nwB6ZPzNzt0&amp;amp;start=false&amp;amp;loop=false&amp;amp;delayms=3000" webkitallowfullscreen="true" width="480"&gt;&lt;/iframe&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/t3CgzHurbck" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/05/transcription-tools-at-tcdl2012.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-4469496210545595842</guid><pubDate>Thu, 03 May 2012 02:24:00 +0000</pubDate><atom:updated>2013-02-26T20:02:52.128-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">client projects</category><title>Bending Regular Expressions to Express Uncertainty</title><description>&lt;i&gt;This post expands upon an email I sent to &lt;a href="http://graysoftinc.com/"&gt;James Edward Gray II&lt;/a&gt;, my mentor in regular expressions, whom I thank for his generosity.&lt;/i&gt; &lt;br /&gt;
&lt;br /&gt;
For the last few months--ever since I started conducting regular
expression workshops at THATCamps--I've been thinking about using
regular expressions to represent uncertainty.  The great power of the
regex is its ability to define context, boundaries, and precision over
what is essentially unknown: &lt;code&gt;/A[BC]/&lt;/code&gt; means "I'm looking for an &lt;code&gt;A&lt;/code&gt;,
followed by either a &lt;code&gt;B&lt;/code&gt; or a &lt;code&gt;C&lt;/code&gt; -- I don't know which, but if you
see either of those, that's the text."&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-XbsocUX3QPw/T6HeHwGhWAI/AAAAAAAAARY/IAEvYf3yUqg/s1600/hw1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="28" src="http://4.bp.blogspot.com/-XbsocUX3QPw/T6HeHwGhWAI/AAAAAAAAARY/IAEvYf3yUqg/s320/hw1.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
I build tools for transcribing handwritten text -- some of it very
old, some of it illegible, some of it damaged by fire or water.
Representing uncertainty is a common problem within that domain, and
it's important for a couple of reasons:&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;Usually the text will be presented out of context -- the ASCII (or
Unicode) characters representing the underlying text will be separated
from the image containing the underlying text.&lt;/li&gt;
&lt;li&gt;Even when the images are available, the person doing the
transcription is far more skilled at deciphering a particular cursive
hand than their readers are likely to be.  Take someone trained in
16th century paleography, who has spent weeks working on a particular
author's handwriting -- their opinion on whether a scribble is an "f"
or an "s" is going to be worth more than that of a casual researcher
who encounters that handwriting in a single record of search results.
We need to pass that opinion from the expert to the reader.&lt;/li&gt;
&lt;/ol&gt;
&lt;br /&gt;
While otherwise rigorous, the methods professional documentary
editors use for recording uncertain readings are pretty shabby --
suited to print editions, they're often concise but imprecise: the
Emerson Journals display missing text via &lt;code&gt;||&amp;nbsp;.&amp;nbsp;.&amp;nbsp;.&amp;nbsp;||&lt;/code&gt;, with "three
dots representing one to five words; four dots, six to ten words; and
five dots, sixteen to thirty words".  Another example uses &lt;code&gt;[&amp;nbsp;.&amp;nbsp;.&amp;nbsp;.&amp;nbsp;]&lt;/code&gt;,
with each period representing an illegible letter.  There is no
convention for expressing "I'm sure this is either an 'a' or a 'u',
but I can't be certain which one." (Kline and Perdue, &lt;i&gt;A Guide to
Documentary Editing, Third Edition&lt;/i&gt;)&lt;br /&gt;
&lt;a href="http://4.bp.blogspot.com/-S4g9VltvUFc/T6HeVg77oXI/AAAAAAAAAR4/odrQAYJEk_E/s1600/mary_daughter_of_james_and_judith_half_jan_20.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="25" src="http://4.bp.blogspot.com/-S4g9VltvUFc/T6HeVg77oXI/AAAAAAAAAR4/odrQAYJEk_E/s320/mary_daughter_of_james_and_judith_half_jan_20.png" width="320" /&gt;&lt;/a&gt;&lt;br /&gt;
This is where the notation regular expressions use for their search
patterns comes in.  It seems like a perfect fit to record
&lt;code&gt;Br[au]mfield&lt;/code&gt; when the user can't tell whether a name is "Brumfield"
or "Bramfield" but is certain that it's not, say "Bremfield".  And
happily, FreeUKGen is doing exactly this -- they've created a
regex-inspired &lt;b&gt;Uncertain Character Format&lt;/b&gt; (UCF) for their volunteers to
use when they're not quite able to make out text:&lt;br /&gt;
&lt;blockquote&gt;
&lt;table border="1"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td valign="TOP"&gt;&lt;code&gt;_&lt;/code&gt;&amp;nbsp;(Underscore)&lt;/td&gt;
&lt;td&gt;A single uncertain character. It could be anything but is definitely one character. It can be repeated for each uncertain character.
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td valign="TOP"&gt;&lt;code&gt;*&lt;/code&gt; (Asterisk)&lt;/td&gt;
&lt;td&gt;Several adjacent uncertain characters. A single &lt;code&gt;*&lt;/code&gt; is used when there are 1 or more adjacent uncertain characters. It is not used immediately before or after a &lt;code&gt;_&lt;/code&gt; or another &lt;code&gt;*&lt;/code&gt;. 
&lt;br /&gt;
Note: If it is clear there is a space, then &lt;code&gt;* *&lt;/code&gt; is used to represent 2 words, neither of which can be read.
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td valign="TOP"&gt;&lt;code&gt;[&lt;i&gt;abc&lt;/i&gt;]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;A single character that could be any one of the contained characters and only those characters. There must be at least two characters between the brackets. &lt;br /&gt;
For example, &lt;code&gt;[79]&lt;/code&gt; would mean either a &lt;code&gt;7&lt;/code&gt; or a &lt;code&gt;9&lt;/code&gt;, whereas &lt;code&gt;[C_]&lt;/code&gt; would mean a &lt;code&gt;C&lt;/code&gt; or some other character.
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td valign="TOP"&gt;&lt;code&gt;{&lt;i&gt;min&lt;/i&gt;,&lt;i&gt;max&lt;/i&gt;}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Repeat count - the preceding character occurs somewhere between &lt;i&gt;min&lt;/i&gt; and &lt;i&gt;max&lt;/i&gt; times. &lt;i&gt;max&lt;/i&gt; may be omitted, meaning
 there is no upper limit. So &lt;code&gt;_{1,}&lt;/code&gt; would be equivalent to &lt;code&gt;*&lt;/code&gt;, and &lt;code&gt;_{0,1}&lt;/code&gt; means that it is unclear if there
 is any character.
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td valign="TOP"&gt;&lt;code&gt;?&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sometimes you will have the situation where all of the characters have been read but you remain uncertain of the word. In this case append a ? at the end of the word
e.g. RACHARD? The most frequent place where a ? is used is with transcription that have been donated from other systems and are being converted for entry into FreeREG.
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/blockquote&gt;
And their volunteers are actually using UCF a lot -- &lt;a href="http://www2.freebmd.org.uk/cgi/show-file.pl?user=hawkeye&amp;amp;file=1843B3B0190"&gt;here&lt;/a&gt;'s a file
with a minimal-but-effective example, and &lt;a href="http://www.freebmd.org.uk/UCFusage.html"&gt;here&lt;/a&gt;'s a list of files with a high incidence of records containing UCF.&lt;br /&gt;
&lt;br /&gt;
One of the only big differences between UCF notation and regular expression notation is the use of underscore (&lt;code&gt;_&lt;/code&gt;) for "I'm not sure what this character is". In a sense, this is equivalent to the regex &lt;code&gt;.&lt;/code&gt; character, but in practice that's not how it's used. &lt;code&gt;/[i.]/&lt;/code&gt; makes no sense in regular expressions: "either &lt;code&gt;'i'&lt;/code&gt; or any character" is turned into the logical set of &lt;code&gt;all characters&lt;/code&gt; plus the set of &lt;code&gt;{&amp;nbsp;'i'&amp;nbsp;}&lt;/code&gt;, the union of which is the same as &lt;code&gt;the set of all characters&lt;/code&gt;.&amp;nbsp; As a result, in regular expressions the &lt;code&gt;'i'&lt;/code&gt; is redundant in &lt;code&gt;[i.]&lt;/code&gt;.&amp;nbsp; However, that's not how they use &lt;code&gt;_&lt;/code&gt; in UCF.&amp;nbsp; &lt;code&gt;[i_]&lt;/code&gt; means "I think this character is an &lt;code&gt;'i'&lt;/code&gt;, but I'm not really sure."&amp;nbsp; That statement is not the same thing as "I don't know what this character is" -- not at all!&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
So, cool!  We've got a notation for describing uncertainty inspired by regular expressions.  Problem solved, right?  Well, not quite.  While FreeUKGen's
UCF represents uncertain readings successfully, I think, there are
still a couple of issues to iron out.&lt;br /&gt;
&lt;br /&gt;
The first one of these is displaying the data -- I won't go too far
into this, as UI is not really my forte, but it seems like we might be
able to represent notations of the form "[a_]" by using a different
font weight.  I have no idea what we'll do about a "Br[au]mfield",
though.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-8V5KH22uv9s/T6HeQex9POI/AAAAAAAAARg/WBnTwdQ_bmo/s1600/isabella_daughter_of_henry_osborne_and_elizabeth_elener.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="51" src="http://2.bp.blogspot.com/-8V5KH22uv9s/T6HeQex9POI/AAAAAAAAARg/WBnTwdQ_bmo/s320/isabella_daughter_of_henry_osborne_and_elizabeth_elener.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
The second issue is in searching the data.  No problem, right?
Regular expressions are designed for searching!  Well, sort of.  In
this case, we expect end users (primarily genealogy researchers) to be typing in
precise search strings like "Brumfield" which they expect to match
against the regular expression &lt;code&gt;/Br[au]mfield/&lt;/code&gt;.  This wouldn't be a
problem if we only had a handful of records -- we'd convert the UCF in
each transcription into its equivalent regular expression, then iterate through each
record, matching it against the user-entered search string.
Unfortunately this approach might take a while on a database
containing hundreds of millions of records.&lt;br /&gt;
&lt;br /&gt;
The problem with searching regular expressions is that you can't index
them.  So far as I'm aware, it's theoretically impossible to shove a
working finite state machine into a B+-tree.  What you can do, however,
is index permutations of UCF -- if a first name is &lt;code&gt;[_J]ane&lt;/code&gt;, you can
at least index &lt;code&gt;Jane&lt;/code&gt; so that a search for "Jane" will find that
record.  You can also permute &lt;code&gt;/Br[au]mfield/&lt;/code&gt; into both &lt;code&gt;"Brumfield"&lt;/code&gt; and
&lt;code&gt;"Bramfield"&lt;/code&gt; and index them each, so that a search on either string
will find the &lt;code&gt;/Br[au]mfield/&lt;/code&gt; record.  This an incomplete solution, in
that its results will differ from the aforementioned, logically
correct approach of applying each regex against the search string.
However, it might be just adequate for the most common cases.&lt;br /&gt;
&lt;br /&gt;
After writing this and reading James's response, I started thinking more about my options.&amp;nbsp; One of these is a parallelized brute-force approach.&amp;nbsp; Why &lt;i&gt;can't&lt;/i&gt; I match each regex in the database against a search string?&amp;nbsp; After all, we're talking about fewer than a billion records, and asking &lt;code&gt;does X match Y&lt;/code&gt; is the sort of thing that is easily parallelized.&amp;nbsp; &lt;i&gt;O brave new world, that has such infrastructure!&lt;/i&gt;  I'm hesitant to go down this path, but I may be missing something -- perhaps some Hadoopy, Erlangy, Map-Reducey algorithm is cheap, easy, and presents the simplest solution to the problem?&amp;nbsp; Any other option is really an approximation to "correct", so it would be a shame to rule this out because of my own lack of experience.&lt;br /&gt;
&lt;br /&gt;
Another approach might be to categorize each kind of UCF expression.  Based on my limited research so far, it appears that the majority of the UCF in the existing transcripts falls into either the "completely unknown" category of &lt;code&gt;"*"&lt;/code&gt; for an entire field, or the nuanced "I think this is a &lt;code&gt;J&lt;/code&gt; but I'm not sure" category represented by &lt;code&gt;[J_]&lt;/code&gt;.&amp;nbsp; We will likely have to handle the former gingerly no matter what we do -- if a surname is totally illegible in the manuscript, the search engine will have to rely on other fields.&amp;nbsp; The former expression could be approximated by &lt;code&gt;"J"&lt;/code&gt;, which would match precise searches well provided the transcriber actually has the greatest possible expertise.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-ty1jTHawjsk/T6HeU2jU5LI/AAAAAAAAARw/IlwMlNXOVSQ/s1600/margaret_mayton_d_of_thomas_1_october.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="18" src="http://1.bp.blogspot.com/-ty1jTHawjsk/T6HeU2jU5LI/AAAAAAAAARw/IlwMlNXOVSQ/s320/margaret_mayton_d_of_thomas_1_october.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Moving into what I expect are rarer cases, expressions like &lt;code&gt;/Br[au]mfield/&lt;/code&gt; would work well with the permutation treatment I outlined above. If the system already supports &lt;code&gt;begins-with&lt;/code&gt; and &lt;code&gt;ends-with&lt;/code&gt; searches, we should be able to index &lt;code&gt;"Brumf*"&lt;/code&gt; as well as &lt;code&gt;"*field"&lt;/code&gt;.&amp;nbsp; In fact, we might even be able to index a single, infixed wildcard like &lt;code&gt;"Br*ld"&lt;/code&gt; by doing both &lt;code&gt;begins-with&lt;/code&gt; and &lt;code&gt;ends-with&lt;/code&gt; searches with a combination of cleverness and hackery.&amp;nbsp; This leaves some smaller number of true regular expression-equivalent UCF-encoded records like "Ca_{1,2}s*d" to deal with.&amp;nbsp; It's possible that this represents such a small sample that the system could actually apply each record containing an irreducible regex to every search, whether via big parallelization or a long loop.&lt;br /&gt;
&lt;br /&gt;
Yet another approach is one that I understand to be deployed on the FreeUKGen databases now -- lossy comparison that reduces UCF to searchable data.&amp;nbsp; For example, the venerable Soundex algorithm begins by stripping non-alphabetic data from a record, converting &lt;code&gt;/[J_]ane/&lt;/code&gt; to &lt;code&gt;"Jane"&lt;/code&gt;.&amp;nbsp; The uncertainty recorded by the transcriber fades into the larger fog of the search algorithm.&amp;nbsp; I'm just as uncomfortable with this methodology as I am with the permutation of &lt;code&gt;/[J_]ane/&lt;/code&gt; to &lt;code&gt;"Jane"&lt;/code&gt; I described above.&amp;nbsp; I suspect that my discomfort is due to simply not knowing what the correct behavior is when a user is searching on &lt;code&gt;/[J_]ane/&lt;/code&gt; -- I know that "Jane" should match, but am not entirely sure whether "Zane" should match the record.&lt;br /&gt;
&lt;br /&gt;
Perhaps the right approach is a hybrid -- use tricks with database indexing for the majority of cases--which don't involve any UCF at all--provide Soundex and Metaphone in transparent ways, and shove the irreducable regular expressions into a spot where they can be processed cheaply.&amp;nbsp; But I really don't know, and I don't anticipate knowing for months yet.&amp;nbsp; Of course, if you happen to have done this before, I'd love to know how.&amp;nbsp; I'm heading to bed, expecting dreams which revolve around &lt;code&gt;gem&amp;nbsp;install&amp;nbsp;index-fsm&lt;/code&gt;.&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/wMSWgZjh0_I" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/05/this-post-expands-upon-email-i-sent-to.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/-XbsocUX3QPw/T6HeHwGhWAI/AAAAAAAAARY/IAEvYf3yUqg/s72-c/hw1.png" height="72" width="72" /><thr:total>3</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-5330155657656507142</guid><pubDate>Wed, 11 Apr 2012 10:34:00 +0000</pubDate><atom:updated>2013-02-26T20:03:30.527-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">crowdsourcing</category><category domain="http://www.blogger.com/atom/ns#">similar projects</category><title>Crowdsourced Transcription Tool List</title><description>When I first started this blog, I spent a lot of time writing detailed reviews of different transcription projects.&amp;nbsp; This has become difficult as my available time shrinks and the number of crowdsourcing projects grows.&amp;nbsp; So when Kate Bowers &lt;a href="http://forums.archivists.org/read/messages?id=77267"&gt;posted&lt;/a&gt; to the Society of American Archivists mailing list asking for a directory of transcription tools, I figured it was time to take a different approach.&lt;br /&gt;
&lt;br /&gt;
&lt;pre&gt;&lt;a href="http://tinyurl.com/TranscriptionToolGDoc"&gt;http://tinyurl.com/TranscriptionToolGDoc&lt;/a&gt;&lt;/pre&gt;
&lt;pre&gt;&amp;nbsp;&lt;/pre&gt;
The link above is a Google Documents spreadsheet listing different tools and the features I thought were relevant.  It's been updated several times over the last few weeks, and I'm pleased to see that it's expanded to include a score of technologies.  I hope it's useful.&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/_e_mPm1DT2Y" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/04/crowdsourced-transcription-tool-list.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-6719879724761370193</guid><pubDate>Wed, 04 Apr 2012 17:47:00 +0000</pubDate><atom:updated>2013-02-26T20:04:19.871-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">crowdsourcing</category><category domain="http://www.blogger.com/atom/ns#">mediawiki</category><category domain="http://www.blogger.com/atom/ns#">similar projects</category><title>French Departmental Archive on Wikisource</title><description>While the transcription world buzzes with news of the release of the 1940 US census and the crowdsourced transcription projects that surround it, I'd like to draw your attention to a blog post published last week on La Tribune des Archives: &lt;a href="http://latribunedesarchives.blogspot.com/2012/02/edition-collaborative-de-manuscrits-sur.html"&gt;"Edition collaborative de manuscrits sur Wikisource : 1er retour d'expérience"&lt;/a&gt;.&amp;nbsp; The post covers the efforts of the &lt;a href="http://www.cg06.fr/fr/decouvrir-les-am/decouverte-du-patrimoine/les-archives-departementales/les-archives-departementales/"&gt;archives&lt;/a&gt; of the department of &lt;a href="http://en.wikipedia.org/wiki/Alpes-Maritimes"&gt;Alpes-Maritimes&lt;/a&gt; to transcribe 17th- and 18th-century records of episcopal visits to the &lt;i&gt;communes&lt;/i&gt; in the diocese.&amp;nbsp; These records are rich sources on local history, but "readers struggle over the chicken-scratch, and the collection is too large to be edited by a single person."&amp;nbsp; The archive has used Wikisource.fr to transcribe these manuscripts with great success, so I'd like to quote and translate extensive portions of their post.&lt;br /&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;h3&gt;
Why Wikisource?&lt;/h3&gt;
&lt;b&gt;It's already there&lt;/b&gt;! (No software to create, maintain, administer, no specs -- just a strong will and a a core of 2-6 people).&lt;br /&gt;
&lt;b&gt;It offers features designed for manuscript editions requiring more than one editor.&lt;/b&gt;&lt;br /&gt;
Particularly useful functions (aside from the collaborative aspect) :&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;Side-by-side display of facsimile and transcription&lt;/li&gt;
&lt;li&gt;Workflow indicating whether a page is transcribed, corrected, or validated by two administrators.&lt;/li&gt;
&lt;li&gt;The visualization is very practical for motivating the community of transcribers.&lt;/li&gt;
&lt;li&gt;Version history control and the ability to comment or discuss difficult issues.&lt;/li&gt;
&lt;li&gt;Wikisource's high Google page rank.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;br /&gt;
The article continues to describe the factors they weighed when choosing material for the project (accessibility of the script and local interest, among others), how they got started (the standard GLAMWiki approach), then continues to the community management aspects I find so fascinating:&lt;br /&gt;
&lt;blockquote class="tr_bq"&gt;
&lt;h3&gt;

&lt;b&gt;How do you motivate your paleographers?&lt;/b&gt;&lt;/h3&gt;
&amp;nbsp;In our experience, transcribers are essentially former university students and internally-trained archivists who want to extend their education (either by making further progress or by avoiding becoming rusty). &lt;br /&gt;
&lt;b&gt;Work times and rest times clearly defined in advance.&lt;/b&gt;&lt;br /&gt;
A regular, fixed-date schedule defined in advance (for example, one month: upload on the 15th and correction every last day of the month) helps the group to make progess and to break up its efforts with relaxation periods (for the eyes, the editors, and the correctors) and lets everyone have rapid feedback (new pages are in fact corrected practically every night). &lt;br /&gt;
&lt;h3&gt;
Findings on the behavior of "students" on Wikisource&lt;/h3&gt;
The first exercises attracted the kind help support of Wikisource regulars and administrators&amp;nbsp; (Adrienne Alix, SereinWMfr, 
Pyb, Hsarrazin), a few new registered paleographers (Cavalié, 
LINCK, Braxmeyer, Gustave) and some anonymous IPs.&amp;nbsp; One or two correctors can suffice easily to keep track of the work of 5-10 "students".&amp;nbsp; Contrary to homework done in class, the "students" apply themselves regularly to the task, and the size and number of contributors does not increase on the night before the deadline.&lt;br /&gt;
Writings dating from before 1660 receive fewer volunteers but could very well serve as university exercises graded online (at the rate of one page per student).&lt;/blockquote&gt;
For more on the archive's efforts (including their similar outreach on Flickr), take a look at the &lt;a href="http://www.cg06.fr/fr/decouvrir-les-am/decouverte-du-patrimoine/les-archives-departementales/actualite/actualite/"&gt;departmental archive news page&lt;/a&gt;.&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/am5lMuUHAb0" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/04/french-departmental-archive-on.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-2873205472651007898</guid><pubDate>Sat, 17 Mar 2012 22:15:00 +0000</pubDate><atom:updated>2013-02-26T20:52:06.487-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">presentations</category><category domain="http://www.blogger.com/atom/ns#">crowdsourcing</category><title>Crowdsourcing at IMLS WebWise 2012</title><description>The &lt;b&gt;&lt;a href="http://www.tvworldwide.com/events/webwise/120229/globe_show/default_go_archive.cfm?gsid=1971&amp;amp;type=flv&amp;amp;test=0&amp;amp;live=0"&gt;video of the crowdsourcing panel&lt;/a&gt;&lt;/b&gt; at IMLS WebWise is online, so I thought I'd post my talk.&amp;nbsp; Like anyone who's created a transcript of their own unscripted remarks, I recommend watching the video. (My bit starts at 6:00, though all the speakers were excellent. Full-screening will hide the slides.)&amp;nbsp; Nevertheless, I've added hyperlinks to the transcript and interpolated the slides with my comments below. &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-0PITR2uRYww/T2TblXDOvwI/AAAAAAAAALQ/hEIOLBd__IU/s1600/shrunk__40_Slide1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-0PITR2uRYww/T2TblXDOvwI/AAAAAAAAALQ/hEIOLBd__IU/s320/shrunk__40_Slide1.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Okay.  I'd like to talk about some of the lessons that have come out of 
my collaborations with small crowdsourcing projects.  We hear a lot about 
these large projects like &lt;a href="http://www.galaxyzoo.org/"&gt;GalaxyZoo&lt;/a&gt;, like &lt;a href="http://www.ucl.ac.uk/transcribe-bentham/"&gt;Transcribe Bentham&lt;/a&gt;. What can 
small institutions and small projects do, and do the rules that seem to apply to large projects also apply to them? &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-8Xsu-OZTGEU/T2TblpfM_9I/AAAAAAAAALY/p6_CjFWmnE0/s1600/shrunk__40_Slide2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-8Xsu-OZTGEU/T2TblpfM_9I/AAAAAAAAALY/p6_CjFWmnE0/s320/shrunk__40_Slide2.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So there are three projects that I'm drawing from here in this 
experience.  The first one I'm going to talk about in a little bit is 
one that was run by &lt;a href="http://www.bpoc.org/"&gt;Balboa Park Online Collaborative&lt;/a&gt;.  It's a &lt;a href="http://fromthepage.bpoc.org/collection/show?collection_id=1&amp;amp;ol=l_hd_c_link"&gt;Klauber Field Notes Transcription Project&lt;/a&gt;--the field 
notes of &lt;a href="http://en.wikipedia.org/wiki/Laurence_Monroe_Klauber"&gt;Laurence M. Klauber&lt;/a&gt;, who was the nation's foremost authority on
 
rattlesnakes.  These are field notes that he kept from 1923 through 
1967.  This is done by the San Diego Natural History Museum and is run 
by our own Perian Sully who is out there in the room somewhere.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-9YLNqDL6z1U/T2TbmmkqmTI/AAAAAAAAALg/gcu7Pj7f4I4/s1600/shrunk__40_Slide3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-9YLNqDL6z1U/T2TbmmkqmTI/AAAAAAAAALg/gcu7Pj7f4I4/s320/shrunk__40_Slide3.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
The next project I want to talk about is the &lt;a href="http://beta.fromthepage.com/display/read_work?ol=c_work_list_work&amp;amp;work_id=18"&gt;Diary of Zenas Matthews&lt;/a&gt;.  
Zenas Matthews was a volunteer from Texas who served in the American 
forces in the US-Mexican War of 1846 and this diary is kept by 
&lt;a href="http://www.southwestern.edu/live/news/6475-collaborative-transcription-project"&gt;Southwestern's article on the Zenas Matthews diary project&lt;/a&gt;. This had 
been digitized for a previous researcher and is small, but Southwestern 
itself is also quite small.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-BCO0O1kUMFs/T2TbnAlgRSI/AAAAAAAAALo/8dRY2WSmHUc/s1600/shrunk__40_Slide4.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-BCO0O1kUMFs/T2TbnAlgRSI/AAAAAAAAALo/8dRY2WSmHUc/s320/shrunk__40_Slide4.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
The third project I want to talk about is actually the origin of the 
software, which is the &lt;a href="http://beta.fromthepage.com/JuliaBrumfield?ol=s_sp_diaries"&gt;Julia Brumfield Diaries&lt;/a&gt;.  If the name looks 
familiar, it's because she's my great-great grandmother.  This project 
was the impetus for me to develop &lt;a href="http://beta.fromthepage.com/?ol=l_hd_logo"&gt;this tool&lt;/a&gt; for crowdsourced 
transcription.&lt;br /&gt;
&lt;br /&gt;
So all of these projects, what they have in common is that we're talking 
about page counts that are in the thousands and volunteer counts that 
are numbered in the dozens at best.  So these are not &lt;a href="http://www.facebook.com/familysearchindexing?v=app_7146470109"&gt;FamilySearch Indexing&lt;/a&gt;, where you can rely on hundreds of thousands of volunteers and 
large networks.&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-VRT0p0gCJlA/T2TbnCkZ2eI/AAAAAAAAALw/67noASMSrY0/s1600/shrunk__40_Slide5.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-VRT0p0gCJlA/T2TbnCkZ2eI/AAAAAAAAALw/67noASMSrY0/s320/shrunk__40_Slide5.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So who participates in large projects and who participates in small 
projects?  One thing that I think is really interesting about 
crowdsourcing and these other sorts of participatory online communities 
is that the ratio of contributions to users follows what's called a 
power-law distribution.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-Ip721oTtgiw/T2TbnTShp3I/AAAAAAAAAL0/EgsnMD112wM/s1600/shrunk__40_Slide6.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-Ip721oTtgiw/T2TbnTShp3I/AAAAAAAAAL0/EgsnMD112wM/s320/shrunk__40_Slide6.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
If you look here, we see--and most famously this is Wikipedia--and you 
see a chart of the number of users on Wikipedia ranked by their 
contributions.  And what you see is that 90% of the edits made 
to Wikipedia are done by 10% of the users. &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-qDN9hz8oLWI/T2TbnlkAAoI/AAAAAAAAAMA/ZREqoPLpkAY/s1600/shrunk__40_Slide7.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-qDN9hz8oLWI/T2TbnlkAAoI/AAAAAAAAAMA/ZREqoPLpkAY/s320/shrunk__40_Slide7.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
If we look at other crowdsourced projects: &lt;a href="http://www.pwrc.usgs.gov/bpp/Charts2.cfm"&gt;this&lt;/a&gt; is the North American 
Bird Phenology Program out of Patuxent &lt;strike&gt;Bay&lt;/strike&gt; Research Center [ed: actually &lt;a href="http://www.pwrc.usgs.gov/bpp/"&gt;Patuxent Wildlife Research Center&lt;/a&gt;], and this is 
a project in which volunteers are transcribing ornithology 
records--basically bird-watching records--that were sent in from the 
&lt;strike&gt;1870s through the 1950s&lt;/strike&gt; [ed: 1880s-1970s], entering them into a database where they can be 
mined for climate change [data].  What's interesting about this to me at 
least is that--and this has been a phenomenally successful project: 
they've got 560,000 cards transcribed all by volunteers, but 
StellaW@Maine here has transcribed 126,000 of them, which is 22% of 
them.  Now, CharlotteC@Maryland is close behind her (so go, local team!) 
but again you see the same kind of curve.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-KujTykDLMYw/T2TboDH0LvI/AAAAAAAAAMI/tJ2e0QyWmOk/s1600/shrunk__40_Slide8.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-KujTykDLMYw/T2TboDH0LvI/AAAAAAAAAMI/tJ2e0QyWmOk/s320/shrunk__40_Slide8.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&amp;nbsp;If we look at another relatively large project, the Transcribe Bentham 
project: this isn't a graph, but if you look at the numbers here, you 
see the same kind of thing.  You see Diane with 78,000 points, you see 
Ben Pokowski with 51,000 points.  You see this curve sort of taper down 
into more of a long tail.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-Q2rRtUmA2Wg/T2TboZ0Gr4I/AAAAAAAAAMQ/XyDgKPLkxSo/s1600/shrunk__40_Slide9.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-Q2rRtUmA2Wg/T2TboZ0Gr4I/AAAAAAAAAMQ/XyDgKPLkxSo/s320/shrunk__40_Slide9.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&amp;nbsp;So what about the small projects?&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-xeBXuCjng-I/T2Tc_dRKXDI/AAAAAAAAAMY/FWVvimy180g/s1600/shrunk__40_Slide10.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-xeBXuCjng-I/T2Tc_dRKXDI/AAAAAAAAAMY/FWVvimy180g/s320/shrunk__40_Slide10.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Well, let's look at the Klauber 
diaries.  This is the top ten transcribers of the field notes of 
Laurence Klauber.  And if you look at the numbers here--again in this 
case it's not quite as pronounced because I think the previous leader 
has dropped out and other people have overtaken him--but you see the 
same kind of distribution.  This is not a linear progression; this is more of a power-law distribution.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-WTfLxYz19Zw/T2Tc_mr5ugI/AAAAAAAAAMg/CfDefkg4KkI/s1600/shrunk__40_Slide11.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-WTfLxYz19Zw/T2Tc_mr5ugI/AAAAAAAAAMg/CfDefkg4KkI/s320/shrunk__40_Slide11.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
If you look at an even smaller project--now, mind you this is a project 
that is really only of interest to members of my family and elderly 
neighbors of the diarist--but look: We've got Linda Tucker who has 
transcribed 713 of these pages followed by me and a few other people.  
But again, you have this power law that the majority of the work is being done by a very small group of people.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-9lSBL3BAkTg/T2Tc_zYUM8I/AAAAAAAAAMo/FdgKXGxQe7I/s1600/shrunk__40_Slide12.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-9lSBL3BAkTg/T2Tc_zYUM8I/AAAAAAAAAMo/FdgKXGxQe7I/s320/shrunk__40_Slide12.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Okay, what's going on really?  What does this mean and why does it 
matter?  The thing that I think this gets to, the reason that I think 
that this is important, is for a couple of reasons. &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-si6J9IjKNDw/T2TdALeVQiI/AAAAAAAAAMw/qZiK3_KyUD8/s1600/shrunk__40_Slide13.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-si6J9IjKNDw/T2TdALeVQiI/AAAAAAAAAMw/qZiK3_KyUD8/s320/shrunk__40_Slide13.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
One is that this kind of behavior addresses one of the main objections 
to crowdsourcing.  Now there are a lot of valid objections to 
crowdsourcing; I think that there are also a few invalid objections and 
one of them is essentially the idea that members of the public cannot 
participate in scholarly projects because my next door neighbor is 
neither capable nor interested in participating in scholarly projects.  
And we see this all over the place.  I mean, here's a few example 
quotes--and I'm not going to read them out.  I believe that this 
objection (which I have heard a number of times; I mean we see some 
examples right here) is a non sequitur. And I believe that the power-law
 
distribution proves that it's a non sequitur.  Really, I saw this most 
egregiously framed by a scholar who was passionately--just absolutely 
decrying--the idea that classical music fans would be able to 
competently translate from German into English because, he said, "After 
all, 40% of South Carolina voted for Newt Gingrich."  &lt;i&gt;Okay.&lt;/i&gt;&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-UmB-4UKFAik/T2TdAayK7NI/AAAAAAAAAM4/sCxjeLsej40/s1600/shrunk__40_Slide14.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-UmB-4UKFAik/T2TdAayK7NI/AAAAAAAAAM4/sCxjeLsej40/s320/shrunk__40_Slide14.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
All right, so what's going on is I think best &lt;a href="http://magistraetmater.blog.co.uk/2010/08/17/what-can-the-vulgus-do-crowd-sourcing-for-medievalists-9195007/"&gt;summed up by Rachel Stone&lt;/a&gt;, 
and what she essentially said is that crowdsourcing isn't getting the 
sort of random distribution from the crowd.  Crowdsourcing is getting a 
number of "well-informed enthusiasts."&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-qRbp2NHsg7Y/T2TdAkIDYCI/AAAAAAAAANA/-oEaDl9dKPE/s1600/shrunk__40_Slide15.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-qRbp2NHsg7Y/T2TdAkIDYCI/AAAAAAAAANA/-oEaDl9dKPE/s320/shrunk__40_Slide15.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So where do we find well-informed enthusiasts to do this work and to do 
it well?  Big projects have an advantage, right?  They have marketing 
budgets.  They have press coverage.  They have an existing user base.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-3gzXavI2rss/T2TdBAsjgzI/AAAAAAAAANI/TGENASUr1W4/s1600/shrunk__40_Slide16.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-3gzXavI2rss/T2TdBAsjgzI/AAAAAAAAANI/TGENASUr1W4/s320/shrunk__40_Slide16.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
If you ask the people at the Transcrbe Bentham project how did they get 
their users, they'll say "Well, you know that New York Times article 
really helped. "&amp;nbsp; That's cool!  All right.  &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-TOslnQ9H4ow/T2TdBbh0UHI/AAAAAAAAANQ/Q4imV8gt9ps/s1600/shrunk__40_Slide17.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-TOslnQ9H4ow/T2TdBbh0UHI/AAAAAAAAANQ/Q4imV8gt9ps/s320/shrunk__40_Slide17.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
The GalaxyZoo people--Citizen Science 
Alliance--yesterday, 24 hours ago, announced a new project, &lt;a href="http://setilive.org/"&gt;SETILive&lt;/a&gt;.  
Now what this does is it pulls in live data from the SETI satellites[sic: actually telescope], 
and in those 24 hours--I took this screenshot; I actually skipped lunch 
to get this one screenshot because I knew that it would pass 10,000 
people participating with 80,000 of these classifications.  And it would 
have been higher, except last night the telescope got covered by cloud 
cover.  So they dropped from getting 30 to 40 contributions per second 
to having to show sort of archival data and getting only 10 
contributions per second.  Well, they can do this because they have an 
existing base of active volunteers that numbers around 600,000.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-KY2UF6faYsE/T2TdBpOrkLI/AAAAAAAAANY/7VHLNBXcHsA/s1600/shrunk__40_Slide18.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-KY2UF6faYsE/T2TdBpOrkLI/AAAAAAAAANY/7VHLNBXcHsA/s320/shrunk__40_Slide18.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So how do WE do that?  How do we find well-informed enthusiasts?  This 
is something that &lt;a href="http://www.southwestern.edu/library/personnel.html"&gt;Kathryn Stallard and Anne Veerkamp-Andersen&lt;/a&gt; at 
Southwestern University Special Collections and I discussed a lot when 
we were trying to launch the Zenas Matthews Diary.  We said, "Well, we 
don't have any budget at all."  Kathryn said, "Well, let's talk about 
local archival newsletters.  Let's post to H-Net lists."  I was in favor 
of looking at online communities of people who might be doing Matthews 
genealogy or the military history war-gamers who have discussion forums 
on the Mexican War.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-Tk-GMpsrN4E/T2TdCJHWX-I/AAAAAAAAANg/xgXrDGr8jaI/s1600/shrunk__40_Slide19.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-Tk-GMpsrN4E/T2TdCJHWX-I/AAAAAAAAANg/xgXrDGr8jaI/s320/shrunk__40_Slide19.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
While we're arguing about this, Kathryn gets an email from a patron 
saying, "Hey, I'm a member of an organization.  We see that you have 
this document.  It relates to the Battle of San Jacinto and the Texas 
Revolution of 1836.  Can you send this to us?"  &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-Zo8aDESJJds/T2TfFLYokHI/AAAAAAAAANo/_fO5GSWRWzE/s1600/shrunk__40_Slide20.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-Zo8aDESJJds/T2TfFLYokHI/AAAAAAAAANo/_fO5GSWRWzE/s320/shrunk__40_Slide20.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
She responds saying, 
"Hey, 1846, great.  Check out this diary we just put online.  I think 
that's what you're talking about."&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-TP-9Gsbzs4c/T2TfFvTVi5I/AAAAAAAAANw/77Ft0VyMRfM/s1600/shrunk__40_Slide21.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-TP-9Gsbzs4c/T2TfFvTVi5I/AAAAAAAAANw/77Ft0VyMRfM/s320/shrunk__40_Slide21.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
Well that wasn't actually what he 
was talking about, but he responds and says, "Yeah, okay, I'll check 
that out, but can you please give me the document I want."  They get it 
back to him and we returned to our discussion of "Okay, what do we need 
to do to roll this out? We're going to start working on the information 
architecture.  We're going to work on the UI.  We're going to work on 
help screens."&amp;nbsp; And while we're having this conversation, Mr. Patrick checks it out.  
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-wB8Z0M_3hGY/T2TfGPQc6dI/AAAAAAAAAN4/HYy1v5Dof6c/s1600/shrunk__40_Slide22.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-wB8Z0M_3hGY/T2TfGPQc6dI/AAAAAAAAAN4/HYy1v5Dof6c/s320/shrunk__40_Slide22.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
And Scott Patrick starts transcribing. &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-nI2liOjzikI/T2TfGvbMyaI/AAAAAAAAAOA/X5xyS_DBTKc/s1600/shrunk__40_Slide23.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-nI2liOjzikI/T2TfGvbMyaI/AAAAAAAAAOA/X5xyS_DBTKc/s320/shrunk__40_Slide23.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
And he starts transcribing some 
more.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-fg0ZTqnpntQ/T2TfG39-h1I/AAAAAAAAAOI/a0uAR98js8g/s1600/shrunk__40_Slide24.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-fg0ZTqnpntQ/T2TfG39-h1I/AAAAAAAAAOI/a0uAR98js8g/s320/shrunk__40_Slide24.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&amp;nbsp; And he continues transcribing.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-LVk8X4cEKGY/T2TfHXFoMfI/AAAAAAAAAOQ/Fju73PLJ4TY/s1600/shrunk__40_Slide25.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-LVk8X4cEKGY/T2TfHXFoMfI/AAAAAAAAAOQ/Fju73PLJ4TY/s320/shrunk__40_Slide25.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&amp;nbsp;And at this point, we're talking 
about working on the wording of the help screens, the wording of our 
announcement trying to attract volunteers, and &lt;b&gt;this is page 43 of the 
43-page diary&lt;/b&gt;!&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-Ds4wtj_b06w/T2TfH70tpPI/AAAAAAAAAOY/kdFqBvJ7DRg/s1600/shrunk__40_Slide26.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-Ds4wtj_b06w/T2TfH70tpPI/AAAAAAAAAOY/kdFqBvJ7DRg/s320/shrunk__40_Slide26.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
And while we're discussing this, he goes back and he 
starts adding footnotes.  Look at this: he's identifying the people who 
are in this, saying, "Hey, this guy who is mentioned is -- here's what 
his later life was.  This other guy--hey, he's my first cousin, by the 
way, but he also left the governorship of the State of Texas to fight in 
this war."&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-WrkXQi_4vtU/T2TfIO44vUI/AAAAAAAAAOg/tW66ldkloh4/s1600/shrunk__40_Slide27.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-WrkXQi_4vtU/T2TfIO44vUI/AAAAAAAAAOg/tW66ldkloh4/s320/shrunk__40_Slide27.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
He sees--and believe me, in the actual original diary, 
Piloncillo is not spelled Piloncillo. I mean it is a -- Zenas Matthews 
does &lt;i&gt;not&lt;/i&gt; know Spanish, right?  He identifies this! He identifies and
looks up works that are mentioned here. &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-e2GD1tiZHm4/T2TfIX-hrgI/AAAAAAAAAOo/otumgjg2McM/s1600/shrunk__40_Slide28.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-e2GD1tiZHm4/T2TfIX-hrgI/AAAAAAAAAOo/otumgjg2McM/s320/shrunk__40_Slide28.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
So wow!  All right!  We got our well-informed enthusiast!  In 14 days, 
he transcribed the diary, and he didn't do just one pass.  I mean as he 
got familiar with the hand, he goes back and revises the earlier 
transcriptions.  He kind of figures out who's involved.  He asks other 
members of his heritage organization what this is.  He adds two dozen 
footnotes.&lt;br /&gt;
&lt;br /&gt;
What just happened?  What was that about?  Who is this guy?  Well, Scott 
Patrick is a retired petroleum worker who got interested in his family 
history, and then got interested in local history, and then got 
interested in heritage organizations.  And &lt;i&gt;he is our ideal "well-informed 
enthusiast"&lt;/i&gt;.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-FUNccbVrvAo/T2TfIn3p6KI/AAAAAAAAAOw/iB5M1HZ1v8g/s1600/shrunk__40_Slide29.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://4.bp.blogspot.com/-FUNccbVrvAo/T2TfIn3p6KI/AAAAAAAAAOw/iB5M1HZ1v8g/s320/shrunk__40_Slide29.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So how did we find him?  The project isn't public yet, right?  Our 
challenge now is rephrasing our public announcement.  We're now looking 
for volunteers to ... something that adequately describes what's left to 
do.  Well, let's go back and take a look at this original letter, 
right?  What we did is, we responded to an inquiry from a patron--and 
not an in-person patron: this is someone who lives 200 miles away from 
Georgetown, Texas.&lt;br /&gt;
&lt;br /&gt;
What you have when someone is coming in and asking about material is, if 
you think about this in terms of target marketing--this is a target-rich 
environment.  Here is someone who is interested.  He's online.  He's 
researching this particular subject.  He is not an existing patron.  he 
has no prior relationship with Southwestern University Libraries, but 
"Hey, while we answer your request, you might check this thing out 
that's in this related field."  That seems to have worked in this one 
case.  Hopefully, we'll get some more experience with future projects.&lt;br /&gt;
&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-r3v04JagnUY/T2TpitKcW0I/AAAAAAAAAO4/WpS3ZJW_S_E/s1600/shrunk__40_Slide30.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-r3v04JagnUY/T2TpitKcW0I/AAAAAAAAAO4/WpS3ZJW_S_E/s320/shrunk__40_Slide30.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Okay, so how do we motivate volunteers?  More importantly, how do we 
avoid de-motivating them? &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-G9P3y80DOc8/T2Tpi8ezq4I/AAAAAAAAAPA/yMnEG8Fv8p8/s1600/shrunk__40_Slide31.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-G9P3y80DOc8/T2Tpi8ezq4I/AAAAAAAAAPA/yMnEG8Fv8p8/s320/shrunk__40_Slide31.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Big projects, a lot of times they have a lot 
of interesting game-like features.  Some of them actually are games.  
You have leader boards, you have badges, you have ways of making the 
experience more immersive.  &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-PTMubfFS5ZU/T2TpjZkc35I/AAAAAAAAAPI/GDLE6nfLakw/s1600/shrunk__40_Slide32.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-PTMubfFS5ZU/T2TpjZkc35I/AAAAAAAAAPI/GDLE6nfLakw/s320/shrunk__40_Slide32.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;a href="http://www.oldweather.org/"&gt;OldWeather&lt;/a&gt;, which is run by GalaxyZoo, will 
plot your ship on a Google map as you transcribe the latitude and 
longitude elements from the log books.&lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-JE65nIn3eu0/T2TpjntcSLI/AAAAAAAAAPQ/eMrNO6PPeEI/s1600/shrunk__40_Slide33.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-JE65nIn3eu0/T2TpjntcSLI/AAAAAAAAAPQ/eMrNO6PPeEI/s320/shrunk__40_Slide33.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
The National Library of Finland 
has partnered with Microtask to actually create a &lt;a href="http://www.digitalkoot.fi/en/splash"&gt;crowdsourcing game of Whac-A-Mole&lt;/a&gt;.  So this is crowdsourcing taken to the extreme. &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-NHn6PcpdrPc/T2Tpj6vibfI/AAAAAAAAAPY/fEWV1MMKXYc/s1600/shrunk__40_Slide34.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-NHn6PcpdrPc/T2Tpj6vibfI/AAAAAAAAAPY/fEWV1MMKXYc/s320/shrunk__40_Slide34.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
But there's a peril here, and the peril is that all of these things are 
extrinsic motivators. &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-0P4kusuqMEM/T2TpmoPEXtI/AAAAAAAAAPg/paXql4Yj1MI/s1600/shrunk__40_Slide35.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-0P4kusuqMEM/T2TpmoPEXtI/AAAAAAAAAPg/paXql4Yj1MI/s320/shrunk__40_Slide35.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
And we ran into this with the Klauber diaries. 
Perian came to me and said, "Hey, let's come up with a stats page, 
because we want to track where the diaries are at.  So we come up with 
the stats page -- pretty basic, here's where some of these are at.  &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-wVXxBM27WwY/T2Tpm1rm5sI/AAAAAAAAAPo/eRGMkGvVdl4/s1600/shrunk__40_Slide36.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-wVXxBM27WwY/T2Tpm1rm5sI/AAAAAAAAAPo/eRGMkGvVdl4/s320/shrunk__40_Slide36.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
And 
hey, while we're at it, let's mine our data.  We can come up with a 
couple of top-10 lists.  So we come up with the top-ten list of 
transcribers and a top-ten list of editors, because that's the data I 
have. &lt;br /&gt;
&lt;br /&gt;
Well remember, the whole point of this exercise is to index these 
diaries so that we can find the mentions of these individual species in 
the original manuscripts.  Do you see indexing on here anywhere?  
Neither did our volunteers, and the minute this went up, the volunteers 
who previously had been transcribing and indexing every single page &lt;b&gt;stopped indexing completely&lt;/b&gt;.  They weren't being measured on it.  We weren't saying that we rewarded them for it, so they stopped.  &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-oeXgX5PAb4Q/T2TpnGm0dBI/AAAAAAAAAPw/U5QdsFJinvc/s1600/shrunk__40_Slide37.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://1.bp.blogspot.com/-oeXgX5PAb4Q/T2TpnGm0dBI/AAAAAAAAAPw/U5QdsFJinvc/s320/shrunk__40_Slide37.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
Needless 
to say, our next big-rush change was a top-ten indexers. &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://2.bp.blogspot.com/-BtdYSB5b_q8/T2Tpna6dIdI/AAAAAAAAAP4/NdCK4jV1btg/s1600/shrunk__40_Slide38.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://2.bp.blogspot.com/-BtdYSB5b_q8/T2Tpna6dIdI/AAAAAAAAAP4/NdCK4jV1btg/s320/shrunk__40_Slide38.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So this gets to "crowding-out" theory of motivation, and the expert on 
this is a researcher in the UK named &lt;a href="http://80gb.wordpress.com/"&gt;Alexandra Eveleigh&lt;/a&gt;.  Her point is 
that if you're going to design any kind of extrinsic motivation, you 
have to make sure that it promotes the actual contributory behavior, and 
this is something that applies, I believe, to small projects as well as 
large projects. &lt;br /&gt;
&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://3.bp.blogspot.com/-Uvm8tnK9Fu8/T2Tpnb4ixPI/AAAAAAAAAQA/lEXdcyYirtk/s1600/shrunk__40_Slide39.PNG" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="240" src="http://3.bp.blogspot.com/-Uvm8tnK9Fu8/T2Tpnb4ixPI/AAAAAAAAAQA/lEXdcyYirtk/s320/shrunk__40_Slide39.PNG" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;
So I have 13 seconds left, so thank you, and I'll just end on that note.&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/Cv4zbjhIe-A" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/03/crowdsourcing-at-imls-webwise-2012.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/-0PITR2uRYww/T2TblXDOvwI/AAAAAAAAALQ/hEIOLBd__IU/s72-c/shrunk__40_Slide1.PNG" height="72" width="72" /><thr:total>2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-780156368356435216</guid><pubDate>Thu, 08 Mar 2012 16:03:00 +0000</pubDate><atom:updated>2013-05-07T10:58:38.331-05:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">business plan</category><title>Jumping In With Both Feet</title><description>&lt;br /&gt;
Although I didn't know it at the time, since I began work on FromThePage in 2005 I've had one toe in the digital humanities community.&amp;nbsp; I've worked on &lt;a href="http://fromthepage.com/"&gt;FromThePage&lt;/a&gt; and I've blogged about crowdsourced manuscript transcription.&amp;nbsp; I've met some smart, friendly people doing fascinating things and I've even taught some of them the magic of regular expressions.&amp;nbsp; But I've always tried to squeeze this work into my "spare time" -- the interstices in the daily life of an involved father and a professional software engineer working a demanding but rewarding job.&amp;nbsp; As the demands of vocation and avocation increase; as disparate duties begin to compete with each other; as new babies come into my home while new technologies come into my workplace and new requests for FromThePage arrive in my inbox, the &lt;a href="http://manuscripttranscription.blogspot.com/2007/06/money-current-situation.html"&gt;basement inventor model&lt;/a&gt; becomes increasingly untenable.&amp;nbsp; The numbers don't lie: I've only checked in code on &lt;a href="https://github.com/benwbrum/fromthepage/commits/master"&gt;four days&lt;/a&gt; during the last six months.&lt;br /&gt;
&lt;br /&gt;
In January&amp;nbsp; I was offered an incredible opportunity.&amp;nbsp; Chris Lintott invited me to the Adler Planetarium to meet the Citizen Science Alliance's dev team.&amp;nbsp; This talented, generous team of astronomer-developers gave me a behind-the-scenes tour of their &lt;a href="http://github.com/zooniverse/scribe"&gt;Scribe&lt;/a&gt; tool--early versions of which powered &lt;a href="http://oldweather.or/"&gt;OldWeather.or&lt;/a&gt;g--and I was blown away.&amp;nbsp; I don't think I've ever been so excited about a technology, and my mind raced with ideas for projects using it.&amp;nbsp; .&amp;nbsp; Serendipitously, two days later I received email from &lt;a href="http://en.wikipedia.org/wiki/Ben_Laurie"&gt;Ben Laurie&lt;/a&gt; asking if I'd like to implement Scribe for the &lt;a href="http://www.freereg.org.uk/"&gt;FreeREG&lt;/a&gt; project, a part of the FreeBMD genealogy charity that is transcribing parish registers recording the baptisms, marriages, and burials in England and Wales from 1538-1835.&amp;nbsp; All development would be released open source, and all data would be as open as possible.&amp;nbsp; It's a dream project for someone with my interests; there was no way I could pass this up.&lt;br /&gt;
&lt;br /&gt;
So as of March 18 I'm starting a new career as an independent digital history developer.&amp;nbsp; It is heartbreaking to leave my friends at &lt;a href="http://www.convio.com/"&gt;Convio&lt;/a&gt; after nearly a dozen years, but I'm delighted with the possibilities my new autonomy offers. I hope to specialize in projects relating to crowdsourcing and/or manuscript transcription, but to be honest I'm not sure where this path will lead.&amp;nbsp;&amp;nbsp; Of course I plan to devote more time to FromThePage -- this year should finally see the publish-on-demand integration I've always been wishing for, as well as a few other features people have requested.&amp;nbsp; If you've got a project that seems appropriate--whether it involves genealogy or herpetology, agricultural history or textile history--drop me a line.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/lE3fbCwcQ-A" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/03/jumping-in-with-both-feet.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-2008175320731167710</guid><pubDate>Mon, 05 Mar 2012 15:21:00 +0000</pubDate><atom:updated>2012-03-07T11:59:10.839-06:00</atom:updated><title>Quality Control for Crowdsourced Transcription</title><description>Whenever I talk about crowd-sourced transcription--actually whenever I talk about crowdsourced &lt;i&gt;anything&lt;/i&gt;--the first question people ask is about accuracy.  Nobody trusts the public add to an institution's data/meta-data, nor especially to correct it.  However, quality control over data entry is a well-explored problem, and while I'm not familiar with the literature from industry regarding commercial approaches, I'd like to offer the systems I've seen implemented in the kinds of volunteer transcription projects I follow.&lt;i&gt;  (Note: the terminology is my own, and may be non-standard.)&lt;/i&gt;&lt;br /&gt;
&lt;ol type="A"&gt;
&lt;li&gt;&lt;b&gt;Single-track methods&lt;/b&gt; (mainly employed with large, prosy text that is difficult to compare against independent transcriptions of the same text). In these methods, all changes and corrections are made to a single transcription which originated with a volunteer and is modified thereafter.&amp;nbsp; There no parallel/alternate transcription to compare against.

&lt;ol type="1"&gt;
&lt;li&gt;&lt;b&gt;Open-ended community revision:&lt;/b&gt;  This is the method that Wikipedia uses, and it's the strategy I've followed in &lt;a href="http://fromthepage.com/"&gt;FromThePage&lt;/a&gt;.  In this method, users may continue to change the text of a transcription forever.  Because all changes are logged--with a pointer of some sort to the user who logged them--vandalism or edits which are not in good faith may be reverted to a known-good state easily.  This is in keeping with the digital humanities principle of "no final version."  In my own projects, I've seen edits made to a transcription two decades after the initial version, and those changes were indeed correct.  (Who knew that "&lt;a href="http://fromthepage.com/display/display_page?ol=w_rw_p_pl&amp;amp;page_id=383"&gt;drugget&lt;/a&gt;" was a coarse fabric used for covering tobacco plant-beds?)  Furthermore, I believe that there is no reason other than the cost of implementation why any of the methods below which operate from the "final version" mind-set should not allow error reports against their "published" form. &lt;/li&gt;
&lt;li&gt;&lt;b&gt;Fixed-term community revision:&lt;/b&gt;  Early versions of both &lt;a href="http://www.ucl.ac.uk/Bentham-Project/transcribe_bentham"&gt;TransribeBentham&lt;/a&gt; and &lt;a href="http://scripto.org/"&gt;Scripto&lt;/a&gt; followed this model, and while I'm not sure if either of them still do, it does seem to appeal to traditional documentary editing projects that are incorporating crowdsourcing as a valuable initial input to a project while wishing to retain ultimate control over the "final version".  In this model, wiki-like systems are used to gather the inital data, with periodic review by experts.  Once a transcription reaches an acceptable status (deemed so by the experts), it is locked to further community edits and the transcription is "published" to a more traditional medium like a CMS or a print edition.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Community-controlled revision work-flows:&lt;/b&gt; This model is a cross between the two above-mentioned methods.  Like fixed-term revision, it embraces the concept of a "final version", after which the text may not be modified.  Unlike fixed-term revision, there are no experts involved here -- rather the tool itself forces a text to &lt;a href="http://en.wikisource.org/wiki/Help:Proofread#Page_status"&gt;go through an edit/review/proofread/reject-approve workflow&lt;/a&gt; by the community, after which the version is locked for future edits.  As far as I'm aware, this is only implemented by the ProofreadPage plugin to MediaWiki that has been used by Wikisource for the past few years, but it seems quite effective.  &lt;/li&gt;
&lt;li&gt;&lt;b&gt;Transcription with "known-bad" insertions before proofreading:&lt;/b&gt; This is a two-phase process, which to my knowledge has only been tried by the Written Rummage project as described in &lt;a href="http://journal.code4lib.org/articles/6004"&gt;Code4Lib issue 15&lt;/a&gt;.  In the first phase, an initial transcription is solicited from the crowd (which in their case is a Mechanical Turk workforce willing to transcribe 19th-century diaries for around eight cents per page).  In the second phase, the crowd is asked to review the initial transcription against the original image, proof-reading and correcting the first transcription.  In order to make sure that a review is effective, however, extra words/characters are added to the data before it is presented to the proof-reader, and the location within the text of these known-bad insertions is recorded.  The resulting corrected transcription is then programmatially searched for the bad data which had been inserted, and if it has been removed the system assumes that any other errors have also been removed -- or at least that a good-faith effort has been made to proofread and correct the transcript.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Single-keying with expert review:&lt;/b&gt;  In this methodology, once a single volunteer contribution is made, it is reviewed by an expert and either approved or rejected.  The expert is necessarily authorized in some sense -- in the case of the &lt;a href="http://blogs.loc.gov/digitalpreservation/2011/12/crowdsourcing-the-civil-war-insights-interview-with-nicole-saylor/"&gt;UIowa Civil War Diaries&lt;/a&gt;, the review is done by the library staff member processing the mailto form contribution, while in the case of &lt;a href="http://www.freereg.org.uk/"&gt;FreeREG&lt;/a&gt; the expert is a "syndicate manager" -- a particular kind of volunteer within the FreeBMD charity. (FreeREG may be unique in using a single-track method for small, structured records, however it demands more paleographic and linguistic expertise from its volunteers than any other project I'm aware of.)  If a transcription is rejected, it may be either returned to the submitter for correction or corrected by the expert and published in corrected form.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Multi-track methods&lt;/b&gt; (mainly employed with easily-isolated, structured records like census entries or ship's log books).  In all of these cases, the same image is presented to different users to be transcribed from scratch.  The data thus collected is compared programmatically on the assumption that two correct transcriptions will agree with each other and may be assumed to be valid.  If the two transcriptions disagree with each other, however, one of them must be in error, so some kind of programmatic or human expert intervention is needed.  It should be noted that all of these methodologys are technically "blind" n-way keying, as the volunteers are unaware of each other's contributions and do not know whether they are interpreting the data for the first time or contributing a duplicate entry.
&lt;ol start="6" type="1"&gt;
&lt;li&gt;&lt;b&gt;Triple-keying with voting:&lt;/b&gt; This is the method that the Zooniverse &lt;a href="http://oldweather.org/"&gt;OldWeather&lt;/a&gt; team uses.  Originally the OldWeather team collected the same information in ten different independent tracks, entered by users who were unaware of each other's contributions: blind, ten-way keying.  The assumption was that majority reading would be the correct one, so essentially this is a voting system.  After some analysis it was determined that the quality of three-way keying was indistinguishable from that of ten-way keying, so the system was modified to a less-skeptical algorithm, saving volunteer effort.  If I understand correctly, the same kind of voting methodology is used by ReCAPTCHA for its OCR correction, which &lt;a href="http://musicmachinery.com/2009/04/27/moot-wins-time-inc-loses/"&gt;allowed its exploitation by 4chan&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Double-keying with expert reconciliation:&lt;/b&gt; In this system, the same entry is shown to two different volunteers, and if their submissions do not agree it is passed to an expert for reconciliation.  This requires a second level of correction software capable of displaying the original image along with both submitted transcriptions.  If I recall my fellow panelist David Klevan's WebWise presentation correctly, this system is used by the Holocaust Museum for one of their crowdsourcing projects.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Double-keying with emergent community-expert reconciliation:&lt;/b&gt; This method is almost identical to the previous one, with one important exception.  The experts who reconcile divergent transcriptions are themselves volunteers -- volunteers who have been promoted to from transcribers to reconcilers through an algorithm.  If a user has submitted a certain (large) number of transcriptions, and if those transcriptions have either 1) matched their counterpart's submission, or 2) been deemed correct by the reconciler when they are in conflict with their counterpart's transcription, then the user is automatically promoted. After promotion, they are able to choose their volunteer activity from either the queue of images to be transcribed or the queue of conflicting transcriptions to be reconciled.  This is the system used by FamilySearch Indexing, and its emergent nature makes it a particularly scalable solution for quality control.&lt;/li&gt;
&lt;li&gt;&lt;b&gt;Double-keying with N-keyed run-off votes:&lt;/b&gt;  Nobody actually does this that I'm aware of, but I think it might be cost-effective.  If the initial set of two volunteer submissions don't agree, rather than submit the argument to an expert, re-queue the transcription to new volunteers.  I'm not sure what the right number is here -- perhaps only a single tie-breaker vote, but perhaps three new volunteers to provide an overwhelming consensus against the original readings.  If this is indecisive, why not re-submit the transcription again to an even larger group?  Obviously this requires some limits, or else the whole thing could spiral into an infinite loop in which your entire pool of volunteers are arguing with each other about the reading of a single entry that is truly indecipherable. However, I think it has some promise as it may have the same scalability benefits of the previous method without needing the complex promotion algorithm nor the reconciliation UI.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;b&gt;Caveats:&lt;/b&gt;  Some things are simply not knowable.  It is hard to evaluate the effectiveness of quality control seriously without taking into account the possibility that volunteer contributors may be correct and experts may be wrong, nor more importantly that some images are simply illegible regardless of the paleographic expertise of the transcriber.  The Zooniverse team is now exploring ways for volunteers to correct errors made not by transcribers but rather by the midshipmen of the watch who recorded the original entries a century ago. They realize that  a mistaken "E" for "W" in a longitude record may be more amenable to correction than a truly illegible entry.  Not all errors are made by the "crowd", after all.&lt;br /&gt;
&lt;br /&gt;
&lt;i&gt;Much of this list is based on observation of working sites and extrapolation, rather than any inside information.  I welcome corrections and additions in the comments or at benwbrum@gmail.com.&lt;/i&gt;&lt;br /&gt;
&lt;br /&gt;
[Update 2012-03-07: Folks from the Transcribe Bentham informed me on Twitter that&lt;i&gt; "&lt;/i&gt;In general, at the moment most transcripts are worked on by one volunteer, checked and then locked. Vols seem to prefer working on fresh MSS to part transcribed." and "For the record, &lt;a class="  twitter-atreply pretty-link" data-screen-name="TranscriBentham" href="https://twitter.com/#%21/TranscriBentham" rel="nofollow"&gt;&lt;s&gt;@&lt;/s&gt;&lt;b&gt;&lt;strong&gt;TranscriBentham&lt;/strong&gt;&lt;/b&gt;&lt;/a&gt; does still use 'Fixed-term community revision'. There are weekly updates on the blog."&amp;nbsp; Thanks, Tim and Justin!]&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/kRq7VPwFLa8" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/03/quality-control-for-crowdsourced.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-7240165035433445371</guid><pubDate>Sun, 04 Mar 2012 22:30:00 +0000</pubDate><atom:updated>2012-03-04T16:30:16.763-06:00</atom:updated><title>We Get Press!</title><description>Crowdsourced transcription projects--and FromThePage in particular--have gotten some really nice press in the last few weeks.&lt;br /&gt;
&lt;br /&gt;
Konrad Lawson posted an excellent review of Scripto and FromThePage on the ProfHacker blog at The Chronicle of Higher Education: &lt;a href="http://chronicle.com/blogs/profhacker/crowdsourcing-transcription-fromthepage-and-scripto/38028"&gt;Crowdsourcing Transcription: FromThePage and Scripto&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
Francine Diep wrote a great article on the phenomenon at Innovation News Daily: &lt;a href="http://www.innovationnewsdaily.com/846-volunteer-transcribers-put-millions-pages-online.html"&gt;Volunteer Transcribers Put Millions of Pages Online&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
Ellen Davis's article on Southwestern's transcription of Zenas Matthews's 1846 Mexican War Diary is especially notable because it includes an interview with Scott Patrick, the volunteer who has done such a spectacular job:&amp;nbsp;&lt;a href="http://www.southwestern.edu/live/news/6475-collaborative-transcription-project"&gt;Collaborative Transcription Project&lt;/a&gt;. &lt;br /&gt;
&amp;nbsp;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/zR3Qm_-MTpk" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/03/we-get-press.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item></channel></rss>
