<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr="http://purl.org/syndication/thread/1.0" version="2.0"><channel><atom:id>tag:blogger.com,1999:blog-5730930067468816440</atom:id><lastBuildDate>Sun, 22 Jan 2012 09:10:13 +0000</lastBuildDate><category>acquisition</category><category>images</category><category>paper</category><category>mediawiki</category><category>blogland</category><category>names</category><category>inspirations</category><category>requirements sxsw</category><category>programming</category><category>business plan</category><category>deployment</category><category>thatcamp features</category><category>open source</category><category>feature plan</category><category>rails</category><category>licensing</category><category>similar projects</category><category>features</category><category>nabpp</category><category>podcasts</category><category>open access</category><category>requirements</category><category>risks</category><category>crowdsourcing</category><category>progress</category><category>subject links</category><category>velehanden</category><category>money</category><title>Collaborative Manuscript Transcription</title><description /><link>http://manuscripttranscription.blogspot.com/</link><managingEditor>noreply@blogger.com (Ben W. Brumfield)</managingEditor><generator>Blogger</generator><openSearch:totalResults>95</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/CollaborativeManuscriptTranscription" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="collaborativemanuscripttranscription" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-3145478643209703039</guid><pubDate>Thu, 19 Jan 2012 03:53:00 +0000</pubDate><atom:updated>2012-01-19T08:31:38.964-06:00</atom:updated><title>A Developer Goes to AHA2012</title><description>Last Sunday I returned from the &lt;a href="http://historians.org/annual/2012/index.cfm"&gt;2012 meeting&lt;/a&gt; of the American Historical Association.&amp;nbsp; Although I have attended my share of conferences and unconferences--from Lone Star Ruby Con to Dreamforce and Texas State Historical Association to Museum Computer Network--I'd never attended one of the big mid-year academic conferences before.&amp;nbsp; The experience was strange but fruitful, and I hope I'll be able to attend again.&lt;br /&gt;
&lt;br /&gt;
Let me start with my superficial impressions.&amp;nbsp; First, historians dress much better than developers do, though they really don't hold a candle to the art gallery folks.&amp;nbsp; They also are a more reactive audience, although there is very little back-channel conversation on Twitter -- in fact I was informed that &lt;a href="http://stumblingpast.wordpress.com/2012/01/13/nearly-there-experiencing-a-conference-online/#comment-314"&gt;typing on laptops would be considered rude&lt;/a&gt;! Finally, they are pretty introverted -- more likely to strike up a conversation with a stranger than your average Rubyist, but not by much.&lt;br /&gt;
&lt;br /&gt;
The conference itself is a bit warped by the fact that many of the attendees are there for the sole purpose of conducting job interviews.&amp;nbsp; This apparently involves a days-long series of thirty-to-ninety minute interviews designed to figure out which candidates to invite to campus for an on-site interview -- a grueling process for the interviewers and an expensive one for the interviewees.&amp;nbsp; (The analogous activity in the software world is the phone screen, in which a hiring manager discusses experience and skill-set with a candidate.&amp;nbsp; Over the phone.)&amp;nbsp; If most attendees are interviewing,&amp;nbsp; they aren't actually participating in the conference -- I was told that around 12,000 people register, but only 5000 attend.&amp;nbsp; This gives AHA a kind of Potemkin village flavor, and it's not unusual to see a panel lecturing to a nearly empty room.&amp;nbsp; In fact, the last session I attended had five speakers on the podium and only three people in the audience.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, AHA2012 and the associated THATCamp were tremendously productive for me.&amp;nbsp; There were several opportunities for collaboration, so while I didn't find my dream partner for &lt;a href="http://fromthepage.com/"&gt;FromThePage&lt;/a&gt;--that institution with a staff of front-end experts and a burning need for transcription software--I did have some really good conversations.&amp;nbsp; I've been trying to add better support for letters to FromThePage, and &lt;a href="http://jeanbauer.com/index.html"&gt;Jean Bauer&lt;/a&gt; gave me a detailed walk-through of the &lt;a href="http://projectquincy.rubyforge.org/"&gt;Project Quincy&lt;/a&gt; data model for correspondence.&amp;nbsp; A lot of people were interested in starting their own crowdsourcing projects, and we've been swapping emails since.&amp;nbsp; Most importantly, while I was in town I met with the development team behind &lt;a href="https://github.com/zooniverse/Scribe"&gt;Scribe&lt;/a&gt; and &lt;a href="https://github.com/zooniverse/Talk"&gt;Talk&lt;/a&gt;, the open-source tools that power Citizen Science Alliance projects like &lt;a href="http://www.oldweather.org/"&gt;OldWeather&lt;/a&gt; and &lt;a href="http://ancientlives.org/"&gt;AncientLives&lt;/a&gt;. I'll be posting about that separately.&lt;br /&gt;
&lt;br /&gt;
One of the things that impressed me most about the history world was the potential there is for a programmer to make a big impact. The graduate student I roomed with was an expert with regular expressions--his texts were in Arabic, so the RTL/LTR mix required him to close his eyes as he composed his patterns--but he had no experience with elementary scripting.&amp;nbsp; In one two-hour hack session, we were able to split a three-hundred-thousand-line medieval biographical dictionary into twenty thousand small files representing individual entries.&amp;nbsp; With a couple more hours' work, we'd have been able to extract dates, places, names, and other data from these files.&amp;nbsp; It is a delight for a software engineer to work in a domain where such minimal effort can make such a difference: most of our work deals with obscure edge cases of hard/boring problems, so removing months of tedious manual labor with an hour's worth of programming is incredibly rewarding.&lt;br /&gt;
&lt;br /&gt;
&lt;a href="http://cliotropic.org/blog/2011/02/aha-2012-crowdsourcing-history/"&gt;Crowdsourcing History: Collaborative Transcription and Archives&lt;/a&gt;, the panel I presented at, seemed to go well.&amp;nbsp; Moderator Shane Landrum invited the audience to give 3-minute presentations on their own crowdsourcing projects after the presenters finished their 8-minute talks, then he opened the floor for questions.&amp;nbsp; Although I was skeptical about this format, it worked very well indeed.&amp;nbsp; In particular, the Q/A period was blessedly free of the self-promoters who plague events like South by Southwest.&amp;nbsp; Perhaps this can be attributed to the novel format or perhaps it was due to the inherent civility of academic historians -- all I know is that it succeeded.&amp;nbsp; I felt very fortunate to be among the panelists, who were a Who's Who of manuscript transcription tools, although a couple prominent projects were not represented because they were too recent to be included in the proposal.&amp;nbsp; Because the context was already set by my fellow panelists and because the time was so constrained, I decided to concentrate my own talk on one feature of FromThePage: subject indexing through wiki-links.&amp;nbsp; An abbreviated recap of the presentation is embedded below:&lt;br /&gt;
&lt;br /&gt;
&lt;iframe allowfullscreen="" frameborder="0" height="315" src="http://www.youtube.com/embed/tN8SiC6uWgk" width="420"&gt;&lt;/iframe&gt;
&lt;br /&gt;
&lt;br /&gt;
On the whole, I think I'd like to go back to the AHA meeting.  The conversations and collaborations made the trip worth the expense, and it was gratifying to finally meet the people behind the big transcription projects face-to-face.&amp;nbsp; I even managed to learn some fascinating stuff about American history.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-3145478643209703039?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/QN0N8PmH_g0" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2012/01/developer-goes-to-aha2012.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://img.youtube.com/vi/tN8SiC6uWgk/default.jpg" height="72" width="72" /><thr:total>2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-1608346670982498746</guid><pubDate>Wed, 07 Dec 2011 11:52:00 +0000</pubDate><atom:updated>2011-12-07T09:18:12.261-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">mediawiki</category><category domain="http://www.blogger.com/atom/ns#">blogland</category><title>Developments in Wikisource/ProofreadPage for Transcription</title><description>Last year I &lt;a href="http://manuscripttranscription.blogspot.com/2009/07/wikisource-for-manuscript-transcription.html"&gt;reviewed&lt;/a&gt; Wikisource as a platform for manuscript transcription projects, concluding that the ProofreadPage plug-in was quite versatile, but that unfortunately the en.wikisource.org policy prohibiting any text not already published on paper ruled out its use for manuscripts.&lt;br /&gt;
&lt;br /&gt;
I'm pleased to report that this policy has been softened.  About a month ago, NARA started to &lt;a href="http://outreach.wikimedia.org/wiki/GLAM/Model_projects/Improving_the_quality_of_OCR"&gt;partner&lt;/a&gt; with the Wikimedia Foundation to to host material—including manuscripts—on Wikisource.&amp;nbsp; While I was at MCN, I discussed this with &lt;a href="http://twitter.com/#%21/filbertkm"&gt;Katie Filbert&lt;/a&gt;, the president of Wikimedia DC, who set me straight.&amp;nbsp; Wikisouce is now very interested in partnering with institutions to host manuscripts of importance, but it is still not a place for ordinary people to upload great-grandpa's journal from World War I.&lt;br /&gt;
&lt;br /&gt;
Once you host a project on Wikisource, what do you do with it?&amp;nbsp; Andie, Rob and Gaurav over at the blog &lt;a href="http://soyouthinkyoucandigitize.wordpress.com/"&gt;So You Think You Can Digitize?&lt;/a&gt;—and it's worth your time to read at least the last six posts—have been writing on exactly that subject.&amp;nbsp; Their &lt;a href="http://soyouthinkyoucandigitize.wordpress.com/2011/12/05/field-note-challenge-part-2-veni-vidi-wiki/"&gt;most recent post&lt;/a&gt; describes their experience with &lt;a href="http://en.wikisource.org/wiki/Field_Notes_of_Junius_Henderson/Notebook_1"&gt;Junius Henderson's Field Notes&lt;/a&gt;, and although it concentrates on their success flushing out more Henderson material and recounts how they dealt with the wikisource software, I'd like to concentrate on a detail:&lt;br /&gt;
&lt;blockquote class="tr_bq"&gt;
What we currently want is a no-cost, minimal effort system that will 
make scans AND transcriptions AND annotations available, and that can 
facilitate text mining of the transcriptions. &amp;nbsp;Do we have that in 
WikiSource? &amp;nbsp;We will see. &amp;nbsp;More on annotations to follow in our next 
post but some &lt;a href="http://en.wikipedia.org/wiki/Father_to_a_Sister_of_Thought"&gt;father to a sister of some thoughts&lt;/a&gt; are already percolating and &lt;a href="http://en.wikisource.org/wiki/Page%3AField_Notes_of_Junius_Henderson%2C_Notebook_1.djvu/7"&gt;we have even implemented some rudimentary examples&lt;/a&gt;.&lt;/blockquote&gt;
This is really exciting stuff.&amp;nbsp; They're experimenting with wiki mark-up of the transcriptions&amp;nbsp; with the goal of annotation and text-mining.&amp;nbsp; I tried to do this back in 2005, but abandoned the effort because I never could figure out how to clearly differentiate MediaWiki articles about subjects (i.e. annotations) from articles that presented manuscript pages and their transcribed text. &amp;nbsp; The lack of wiki-linking was also the one of my criticisms &lt;a href="http://de.wikisource.org/wiki/Wikisource:Skriptorium/Archiv/2010/Oktober#Feedback_aus_Texas"&gt;most taken to heart&lt;/a&gt; by the German Wikisource community last October.&lt;br /&gt;
&lt;br /&gt;
So how is the mark-up working out?&amp;nbsp; Gaurav and the team have addressed the differentiation issue by using cross-wiki links, a standard way of linking from an article on one Wikimedia project to another.&amp;nbsp; So the text "English sparrows" in the transcription is annotated &lt;code&gt;[[:w:Passer domesticus|English sparrows]]&lt;/code&gt;, which is wiki-speak for &lt;i&gt;Link the text "English sparrows" to the Wikipedia article "Passer domesticus"&lt;/i&gt;. Wikipedia's redirects then send the browser off to the article "&lt;a href="http://en.wikipedia.org/wiki/Passer_domesticus"&gt;House Sparrow&lt;/a&gt;".&lt;br /&gt;
&lt;br /&gt;
So far so good.&amp;nbsp; The only complaint I can make is that—so far as I can tell—cross-wiki links don't appear in the "What links here" screen tool on Wikipedia, neither for &lt;a href="http://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Passer_domesticus&amp;amp;limit=500"&gt;Passer domesticus&lt;/a&gt;, nor for &lt;a href="http://en.wikipedia.org/wiki/Special:WhatLinksHere/House_Sparrow"&gt;House Sparrow&lt;/a&gt;.&amp;nbsp; This means that the annotation can't provide an indexing function, so that users can't see all the &lt;a href="http://beta.fromthepage.com/article/show?article_id=8273&amp;amp;ol=blog"&gt;pages that reference possums&lt;/a&gt;, nor &lt;a href="http://beta.fromthepage.com/display/read_all_works?article_id=8273"&gt;read a selection of those pages&lt;/a&gt;.&amp;nbsp; I'm not sure that the cross-wiki link data isn't tracked, however — just that I can't see it in the UI.&amp;nbsp; Tantalizingly, cross-wiki links are tracked when images or other files are included in multiple locations: see the &lt;a href="http://en.wikipedia.org/wiki/File:House_Sparrow_mar08.jpg#globalusage"&gt;"Global file usage" section of the sparrow image&lt;/a&gt;, for example.&amp;nbsp; Perhaps there is an API somewhere that the Henderson Field Note project could use to mine this data, or perhaps they could move their links targets from Wikipedia articles to some intermediary in a different Wikisource namespace.&lt;br /&gt;
&lt;br /&gt;
Regardless, the direction Wikisource is moving should make it an excellent option for institutions looking to host documentary transcription projects and experiment with crowdsourcing without running their own servers.&amp;nbsp; I can't wait to see what happens once Andie, Rob, and Gaurav start experimenting with PediaPress!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-1608346670982498746?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/y8hWKnDf2Wo" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2011/12/developments-in-wikisourceproofreadpage.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>10</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-4469924496860764576</guid><pubDate>Fri, 18 Nov 2011 12:53:00 +0000</pubDate><atom:updated>2011-11-18T10:23:11.221-06:00</atom:updated><title>Crowdsourcing Transcription at MCN 2011</title><description>&lt;div style="width:425px" id="__ss_10220731"&gt;&lt;strong style="display:block;margin:12px 0 4px"&gt;&lt;a href="http://www.slideshare.net/benwbrum/mcn2011-crowdsourcing-transcription" title="MCN2011 Crowdsourcing Transcription"&gt;MCN2011 Crowdsourcing Transcription&lt;/a&gt;&lt;/strong&gt;&lt;object id="__sse10220731" width="425" height="355"&gt;&lt;param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=crowdsourcingtranscription-111118101914-phpapp02&amp;stripped_title=mcn2011-crowdsourcing-transcription&amp;userName=benwbrum" /&gt;&lt;param name="allowFullScreen" value="true"/&gt;&lt;param name="allowScriptAccess" value="always"/&gt;&lt;param name="wmode" value="transparent"/&gt;&lt;embed name="__sse10220731" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=crowdsourcingtranscription-111118101914-phpapp02&amp;stripped_title=mcn2011-crowdsourcing-transcription&amp;userName=benwbrum" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" wmode="transparent" width="425" height="355"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;div style="padding:5px 0 12px"&gt;View more &lt;a href="http://www.slideshare.net/"&gt;presentations&lt;/a&gt; from &lt;a href="http://www.slideshare.net/benwbrum"&gt;benwbrum&lt;/a&gt;.&lt;/div&gt;&lt;/div&gt;
These are links to the papers, websites, and systems mentioned in my presentation at the Museum Computer Network 2011 conference.&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;Can't OCR Cursive&amp;nbsp;&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://fht.byu.edu/prev_workshops/workshop06/slides/5-DougKennard.pdf"&gt;"Toward Searchable Indexes for Handwritten Documents&lt;/a&gt;", Kennard and Barrett&lt;/li&gt;
&lt;/ul&gt;
&lt;li&gt;Genealogy &lt;/li&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.facebook.com/familysearchindexing"&gt;FamilySearch Indexing Facebook Page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;BYU &lt;a href="http://journals.byu.edu/"&gt;Historic Journals Project&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;li&gt;Natural Science&amp;nbsp;&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.ala.org.au/"&gt;Atlas of Living Australia&lt;/a&gt; Australian Museum Cicada Expedition&lt;/li&gt;
&lt;li&gt; Botanical Society of the British Isles &lt;a href="http://herbariaunited.org/atHome/"&gt;Herbaria@Home&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;li&gt;Open Source/Creative Commons&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://manuscripttranscription.blogspot.com/2009/07/wikisource-for-manuscript-transcription.html"&gt;Wikisource for Manuscript Tranascription&lt;/a&gt; (my review)&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.opengenalliance.org/"&gt;Open Genealogy Alliance&lt;/a&gt; (UK) &lt;/li&gt;
&lt;/ul&gt;
&lt;li&gt;Libraries&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://menus.nypl.org/"&gt;What's on the Menu?&lt;/a&gt; New York Public Library&lt;/li&gt;
&lt;li&gt;University of Iowa Libraries &lt;a href="http://digital.lib.uiowa.edu/cwd/index.php"&gt;Civil War Diaries and Letters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.flickr.com/photos/statelibrarync/sets/72157627124710723/with/6347782489/"&gt;North Carolina Family History Transcription Project&lt;/a&gt;, Government and Heritage Library, State Library of North Carolina&lt;/li&gt;
&lt;/ul&gt;
&lt;li&gt;Archives&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt;Stadsarchief Amsterdam's &lt;a href="http://militieregisters.nl/"&gt;Militieregisters.nl&lt;/a&gt; and &lt;a href="http://velehanden.nl/"&gt;Velehanden.nl&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://prov.versi.edu.au/"&gt;Public Records Office Victoria&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;li&gt;Museums&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt;San Diego Natural History Museum's &lt;a href="http://fromthepage.bpoc.org/"&gt;Laurence Klauber Field Notes &lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;li&gt;Why?&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt;North American Bird Phenology Program &lt;a href="http://www.pwrc.usgs.gov/bpp/SatisfactionSurveyReport2010/Satisfaction_Survey_2010.BK.html"&gt;User Satisfaction Survey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.blogger.com/%20http://www.oldweather.org/"&gt;OldWeather&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.digitalkoot.fi/en"&gt;DigitalKoot&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;North American Bird Phenology Program &lt;a href="http://www.pwrc.usgs.gov/bpp/Charts2.cfm"&gt;Top 50 Transcribers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;North American Bird Phenology Projgram &lt;a href="http://www.pwrc.usgs.gov/bpp/Newsletters/E-Newsletter_April_2011.pdf"&gt;April Newsletter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Code4Lib Journal, &lt;a href="http://www.blogger.com/%22http://journal.code4lib.org/articles/6004"&gt;"Using Amazon Mechanical Turk to Transcribe Historical Handwritten Documents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;li&gt;How will you use the transcription?&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.transcribe-bentham.da.ulcc.ac.uk/td/Transcribe_Bentham"&gt;Transcribe Bentham&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Julia Brumfield Diaries &lt;a href="http://beta.fromthepage.com/display/read_all_works?article_id=52"&gt;view of references to Irvin Harvey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://wardepartmentpapers.org/scripto/?documentId=2886"&gt;Papers of the War Department&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;li&gt;Tools&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt;Wikisource/ProofreadPage (see review above)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/benwbrum/fromthepage/wiki"&gt;FromThePage&lt;/a&gt; on GitHub&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/chnm/scripto"&gt;Scripto &lt;/a&gt;on GitHub&lt;/li&gt;
&lt;li&gt;&lt;a href="http://code.google.com/p/tb-transcription-desk/"&gt;Bentham Transcription Desk&lt;/a&gt; on Google Code&lt;/li&gt;
&lt;li&gt;Zooniverse &lt;a href="https://github.com/zooniverse/Scribe"&gt;Scribe&lt;/a&gt; on GitHub&lt;/li&gt;
&lt;li&gt;&lt;a href="http://code.google.com/p/openscribe/"&gt;OpenScribe &lt;/a&gt;on Google Code&lt;/li&gt;
&lt;/ul&gt;
&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-4469924496860764576?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/w20vhnnYxT4" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2011/11/crowdsourcing-transcription-at-mcn-2011.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-1578032716306547699</guid><pubDate>Fri, 05 Aug 2011 12:52:00 +0000</pubDate><atom:updated>2011-08-05T08:09:47.613-05:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">mediawiki</category><category domain="http://www.blogger.com/atom/ns#">blogland</category><title>Programmers: Wikisource Needs You!</title><description>Wikisource is powered by a MediaWiki extension which allows page images to be displayed beside the wiki editing form.  This extension also handles editorial workflow by allowing pages, chapters, and books to be marked as unedited, partially edited, in need of review, or finished.  It's a fine system, and while the policy of the English language Wikisource community prevents it from being used for manuscript transcription, there are &lt;a href="http://manuscripttranscription.blogspot.com/2009/07/wikisource-for-manuscript-transcription.html"&gt;active manuscript projects&lt;/a&gt; using the software in other communities.&lt;br /&gt;&lt;br /&gt;Yesterday, &lt;a href="http://hexmode.com/"&gt;Mark Hershberger&lt;/a&gt; wrote this in a comment: &lt;span style="font-style: italic;"&gt;For what its worth the extension used by WikiSource, ProofreadPage, now needs a maintainer.  I posted about this here: &lt;/span&gt;&lt;a style="font-style: italic;" href="http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/54831" target="_blank"&gt;http://thread.gmane.org/gmane.&lt;wbr&gt;science.linguistics.wikipedia.&lt;wbr&gt;technical/54831&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;While I'm sorry to hear it, this is an excellent opportunity for someone with Mediawiki skills to do some real good.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-1578032716306547699?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/Nd1A_VeklSs" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2011/08/programmers-wikisource-needs-you.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-1150009875047882375</guid><pubDate>Tue, 26 Jul 2011 12:44:00 +0000</pubDate><atom:updated>2011-07-26T08:29:15.416-05:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">similar projects</category><title>Can a Closed Crowdsourcing Project Succeed?</title><description>Last night, the Zooniverse folks announced their latest venture: &lt;a href="http://www.ancientlives.org/"&gt;Ancient Lives&lt;/a&gt;, which &lt;a href="http://www.ox.ac.uk/media/news_stories/2011/110726.html"&gt;invites&lt;/a&gt; the public to help analyze the &lt;a href="http://en.wikipedia.org/wiki/Oxyrhynchus_Papyri"&gt;Oxyrhynchus Papyri&lt;/a&gt;.  The transcription tool meets the high standards we now expect from the team who designed &lt;a href="http://www.oldweather.org/"&gt;Old Weather&lt;/a&gt;, but the project immediately stirred some controversy because of its terms of use:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/-sBaMbDlCbnk/Ti66-oKZIcI/AAAAAAAAAEQ/jj8PMboAkG0/s1600/sgillies_tweet.jpg"&gt;&lt;img style="cursor: pointer; width: 400px; height: 189px;" src="http://4.bp.blogspot.com/-sBaMbDlCbnk/Ti66-oKZIcI/AAAAAAAAAEQ/jj8PMboAkG0/s400/sgillies_tweet.jpg" alt="" id="BLOGGER_PHOTO_ID_5633645768982733250" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Sean is referring to this section of the &lt;a href="http://www.ancientlives.org/copyright"&gt;copyright statement&lt;/a&gt; (technically, not a terms of use), which is re-displayed from the tutorial:&lt;br /&gt;&lt;blockquote&gt;Images may not be copied or offloaded, and the images and their texts  may not be published. All digital images of the Oxyrhynchus Papyri are ©   Imaging Papyri Project, University of Oxford. The papyri themselves  are owned by the Egypt Exploration Society, London. All rights reserved. &lt;/blockquote&gt;Future use of the transcriptions may be hinted at a bit on the About page:&lt;br /&gt;&lt;blockquote&gt;The papyri belong to the Egypt Exploration Society and their texts will  eventually be published and numbered in Society's Greco-Roman Memoirs  series in the volumes entitled &lt;em&gt;The Oxyrhynchus Papyri.&lt;/em&gt;&lt;/blockquote&gt;&lt;em&gt;&lt;/em&gt;It should be noted that the closed nature of the project is likely a side-effect of UK copyright law, not a policy decision by the Zooniverse team.  In the US, a scan or transcription of a public domain work is also public domain and not subject to copyright.  In the UK, however, scanning an image creates a copyright in the scan, so the up-stream providers automatically are able to restrict down-stream use of public domain materials.  In the case of federated digitization projects this can create a situation like that of the Old Bailey Online, where different pieces of a seemingly-seamless digital database are &lt;a href="http://www.oldbaileyonline.org/static/Legal-info.jsp#copyright"&gt;owned by entirely different institutions&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I will be very interested to see how the Ancient Lives project fares compared to GalaxyZoo's other successes.   If the transcriptions are posted and accessible on their own site, users may not care about the legal ownership of the results of their labor.  They've already had &lt;a href="http://twitter.com/#%21/ancientlives/status/95815291237441536"&gt;100,000 characters transcribed&lt;/a&gt;, so perhaps these concerns are irrelevant for most volunteers.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-1150009875047882375?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/iYbYLoE_CHI" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2011/07/can-closed-crowdsourcing-project.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/-sBaMbDlCbnk/Ti66-oKZIcI/AAAAAAAAAEQ/jj8PMboAkG0/s72-c/sgillies_tweet.jpg" height="72" width="72" /><thr:total>5</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-893363630030046082</guid><pubDate>Wed, 20 Jul 2011 12:05:00 +0000</pubDate><atom:updated>2011-07-20T07:43:34.276-05:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">crowdsourcing</category><title>Crowdsourcing and Variant Digital Editions</title><description>Writing at the JISC Digitization Blog, Alastair Dunning warns of "&lt;a href="http://digitisation.jiscinvolve.org/wp/2011/07/18/crowdsourcing-and-variant-digital-editions-some-troubles-ahead/"&gt;problems with crowdsourcing having the ability to create multiple editions&lt;/a&gt;."&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;For example, the much-lauded Early English Books Online (&lt;strong&gt;EEBO&lt;/strong&gt;) and Eighteenth Century Collections Online (&lt;strong&gt;ECCO&lt;/strong&gt;) are &lt;strong&gt;now beginning to appear on many different digital platforms&lt;/strong&gt;. &lt;/p&gt; &lt;p&gt;ProQuest currently hold a licence that allows users to search over the &lt;a href="http://eebo.chadwyck.com/"&gt;entire EEBO corpus&lt;/a&gt;, while &lt;a href="http://gale.cengage.co.uk/product-highlights/history/eighteenth-century-collections-online.aspx"&gt;Gale-Cengage own the rights to ECCO&lt;/a&gt;. &lt;/p&gt; &lt;p&gt;Meanwhile, JISC Collections are planning to release a platform entitled &lt;a href="http://www.jisc-collections.ac.uk/jiscecollections/jischistoricbooks/"&gt;JISC Historic Books&lt;/a&gt;, which makes licenced versions of EEBO and ECCO available to UK Higher Education users. &lt;/p&gt; &lt;p&gt;And finally, the Universities of Michigan and Oxford are heading the &lt;a href="http://www.lib.umich.edu/tcp/"&gt;Text Creation Partnership&lt;/a&gt; (TCP), which is methodically working its way through releasing full-text versions of &lt;a href="http://quod.lib.umich.edu/e/eebogroup/"&gt;EEBO&lt;/a&gt;, &lt;a href="http://quod.lib.umich.edu/e/ecco/"&gt;ECCO &lt;/a&gt;and other resources. These versions are available online, and are also being harvested out to sites like &lt;a href="http://www.18thconnect.org/"&gt;18th Century Connect&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;So this gives us four entry points into ECCO – and it’s not inconceivable that there could be more in the future.&lt;/p&gt; &lt;p&gt;What’s more, there have been some initial discussions about  introducing crowdsourcing techniques to some of these licensed versions;  allowing permitted users to transcribe and interpret the original  historical documents. But of course this crowdsourcing would happen on  different platforms with different communities, who may interpret and  transcribe the documents in different way. &lt;strong&gt;This could lead to  the tricky problem of different digital versions of the corpus. Rather  than there being one EEBO, several EEBOs exist&lt;/strong&gt;.&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;&lt;/p&gt;Variant editions are indeed a worrisome prospect, but I don't think that it's unique to projects created through crowdsourcing.  In fact, I think that the mechanism of producing crowdsourced editions actually reduces the possibility for variants to emerge.  Dunning and I corresponded briefly over Twitter, then I wrote this comment to the JISC Digitization blog.  Since that blog seems to be choking on the mark-up, I'll post my reply here:&lt;br /&gt;&lt;blockquote&gt;&lt;i&gt;benwbrum&lt;/i&gt; Reading @alastairdunning's post connecting&lt;br /&gt;crowdsourcing to variant editions: bit.ly/raVuzo Feel like Wikipedia&lt;br /&gt;solved this years ago.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;benwbrum&lt;/i&gt; If you don't publish (i.e. copy) a "final" edition of a crowdsourced transcription, you won't have variant "final" versions.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;benwbrum&lt;/i&gt; The wiki model allows linking to a particular version of an article. I expanded this to the whole work: &lt;a href="http://bit.ly/p7hfWR"&gt;link&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;alastairdunning&lt;/i&gt; But does that work with multiple providers offering restricted access to the same corpus sitting on different platforms?&lt;br /&gt;&lt;br /&gt;&lt;i&gt;alastairdunning&lt;/i&gt; ie, Wikipedia can trace variants cause it's all on the same platform; but there are multiple copies of EEBO in different places&lt;br /&gt;&lt;br /&gt;&lt;i&gt;benwbrum&lt;/i&gt; I'd argue the problem is the multiple platforms, not the crowdsourcing.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;alastairdunning&lt;/i&gt; Yes, you're right. Tho crowdsourcing considerably amplifies the problem as the versions are likely to diverge more quickly&lt;br /&gt;&lt;br /&gt;&lt;i&gt;benwbrum&lt;/i&gt; You're assuming multiple platforms for both reading and editing the text? That could happen, akin to a code fork.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;benwbrum&lt;/i&gt; Also, why would a crowd sourced edition be restricted? I don't think that model would work.&lt;/blockquote&gt;I'd like to explore this a bit more.  I think that variant editions are less likely in a crowdsourced project than in a traditional edition, but efforts to treat crowdsourced editions in a traditional manner can indeed result in the situation you warn against.&lt;br /&gt;&lt;br /&gt;When we're talking about crowdsourced editions, we're usually talking about user-generated content that is produced in collaboration with an editor or community manager.  Without exception, this requires some significant technical infrastructure -- a wiki platform for transcribing free-form text or an even more specialized tool for transcribing structured data like census records or menus.  For most projects, the resulting edition is hosted on that same platform -- the Bentham wiki which displays the transcriptions for scholars to read and analyze is the same tool that volunteers use to create the transcriptions.  This kind of monolithic platform does not lend itself to the kind of divergence you describe: copies of the edition are always dated as soon as they are separated from the production platform, and making a full copy of the production platform requires a major rift among the editors and volunteer community.  These kind of rifts can happen--in my world of software development, the equivalent phenomenon is a code fork--but they're very rare.&lt;br /&gt;&lt;br /&gt;But what about projects which don't run on a monolithic platform? There are a few transcription projects in which editing is done via a wiki (Scripto) or webform (UIowa) but the transcriptions are posted to a content management system.  There is indeed potential for the "published" version on the CMS to drift from the "working" version on the editing platform, but in my opinion the problem lies not in crowdsourcing, but in the attempt to impose a traditional publishing model onto a participatory project by inserting editorial review in the wrong place:&lt;br /&gt;&lt;br /&gt;Imagine a correspondence transcription project in which volunteers make their edits on a wiki but the transcriptions are hosted on a CMS. One model I've seen often involves editors taking the transcriptions from the wiki system, reviewing and editing them, then publishing the final versions on the CMS.  This is a tempting work-flow -- it makes sense to most of us both because the writer/editor/reader roles are clearly defined and because the act of copying the  transcription to the CMS seems analogous to publishing a text.  Unfortunately, this model fosters divergence between the "published" edition and the working copy as voluteers continue to make changes to the transcriptions on the wiki, sometimes ignoring changes made by the reviewer, sometimes correcting text regardless of whether a letter has been pushed to the CMS.  The alternative model has reviewers make their edits within the wiki system itself, with content pushed to the CMS automatically.  In this model, the wiki is the system-of-record; the working copy is the official version.  Since the CMS simply reflects the production platform, it does not diverge from it.  The difficulty lies in abandoning the idea of a final version.&lt;br /&gt;&lt;br /&gt;It's not at all clear to me how EEBO or ECCO are examples of crowdsourcing, rather than traditional restricted-access databases created and distributed through traditional means, so I'm not sure that they're good examples.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-893363630030046082?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/rAc7UJbpMSg" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2011/07/crowdsourcing-and-variant-digital.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>5</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-3450215786199639872</guid><pubDate>Wed, 16 Feb 2011 01:39:00 +0000</pubDate><atom:updated>2011-02-15T22:02:39.226-06:00</atom:updated><title>My Goals for FromThePage</title><description>A couple of recent interactions have made me realize that I've never publicly articulated my goals in developing FromThePage.  Like anyone else managing a multi-year project, my objectives have shifted over time.  However, there are three main themes of my work developing web-based software for transcribing handwritten documents:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Transcribing and publishing family diaries.&lt;/span&gt;  FromThePage was developed to allow me and my immediate family to collaboratively transcribe the diaries of Julia Craddock Brumfield (fl. 1915-1936), my great-great grandmother.  This objective has drifted over time to include the diaries of Jeremiah White Graves (fl. 1823-1878) as well, but despite that addition the effort is on track to achieve this original goal.  Since the website was announced to my extended family in early 2009, Linda Tucker—a cousin whom I'd never met—has transcribed every single page I've put online, then located and scanned three more diaries that were presumed lost.  The only software development work remaining to complete this goal is the integration of the tool with a publish-on-demand service so that we may distribute the diaries to family members without Internet access.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Creating generally useful transcription software.&lt;/span&gt;  As I developed FromThePage, I quickly realized that the tool would be useful to anyone transcribing, indexing and annotating handwritten material.  It seemed a waste to pour effort into a tool that was only accessible to me, so a new goal arose of converting FromThePage into a viable multi-user software project.  This has been a more difficult endeavor, but in 2010 I released FromThePage under the AGPL, and it's been adopted with great enthusiasm by the Balboa Park Online Collaborative for transcription projects at the San Diego Natural History Museum.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Providing access to privately-held manuscripts.&lt;/span&gt;  The vision behind FromThePage is to generalize my own efforts digitizing family diaries across the broader public.  There is what I call an &lt;span style="font-style: italic;"&gt;invisble archive&lt;/span&gt;--an enormous collection of primary documents that is distributed among the filing cabinets, cedar chests, and attics of the nation's nostalgic great aunts, genealogists, and family historians.  This archive is inaccessible to all but the most closely connected family and neighbors of the documents' owners — indeed it's most often not merely inaccessible but entirely unknown.  When effort is put into researching this archival material, it's done by amateurs like myself, and more often than not the results are naïve works of historical interpretation: rather than editing and annotating a Civil War diary, the researcher draws from it to create yet another Lost Cause narrative.  I would love for FromThePage to transform this situation, channeling amateur efforts into digitizing and sharing irreplaceable primary material with researchers and family members alike.  This has proven a far greater challenge than my proximate or intermediate goals: technically speaking the  processing and hosting of page scans has been costly and difficult, while my efforts to recruit from the family history community have met with little success.  Nevertheless I remain hopeful that events like this month's RootsTech conference will build the same online network among family researchers that THATCamp has among professional digital humanists.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-3450215786199639872?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/lDVX2Jxd3j4" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2011/02/my-goals-for-fromthepage.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>1</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-6691934409623254185</guid><pubDate>Wed, 02 Feb 2011 20:17:00 +0000</pubDate><atom:updated>2011-02-02T14:21:56.382-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">progress</category><category domain="http://www.blogger.com/atom/ns#">similar projects</category><category domain="http://www.blogger.com/atom/ns#">blogland</category><title>2010: The Year of Crowdsourcing Transcription</title><description>2010 was the year that collaborative manuscript transcription finally caught on.&lt;br /&gt;&lt;br /&gt;Back when I started work on FromThePage in 2005, I got the same response from most people I told about the project: "Why don't you just use OCR software?"  To say that the challenges of digitizing handwritten material were poorly understood might be inaccurate—after all the TEI standard included an entire &lt;a href="http://www.tei-c.org/release/doc/tei-p5-doc/en/html/MS.html"&gt;chapter on manuscripts&lt;/a&gt;—but there were  no tools in use designed for the purpose.  Five years later, half a dozen web-based transcription projects are in progress and new projects may choose from several existing tools to host their own.  Crowdsourced transcription was even the &lt;a href="http://www.nytimes.com/2010/12/28/books/28transcribe.html?pagewanted=1&amp;amp;_r=1"&gt;written up in the New York Times&lt;/a&gt;!&lt;br /&gt;&lt;br /&gt;I'm going to review the field as I see it, then make some predictions for 2011.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ongoing Structured Transcription Projects&lt;/span&gt;&lt;br /&gt;By far the most successful transcription project is &lt;span style="font-weight: bold;"&gt;FamilySearch Indexing&lt;/span&gt;.  In 2010, &lt;a href="http://www.facebook.com/familysearchindexing#%21/familysearchindexing?v=app_6009294086"&gt;185,900,667&lt;/a&gt; records were transcribed from manuscript census forms, parish registers, tithe lists, and &lt;a href="http://indexing.familysearch.org/projects/current_projects.jsf"&gt;other sources&lt;/a&gt; world-wide.  This brings the total up to 437,795,000 records double-keyed and reconciled by more than &lt;a href="http://us1.campaign-archive.com/?u=6a13a38a955e01499f3215f48&amp;amp;id=b74d835fa5"&gt;four hundred thousand&lt;/a&gt; volunteers — itself an awe-inspiring number with an &lt;a href="http://ancestryinsider.blogspot.com/2009/07/familysearch-support.html"&gt;equally impressive support structure&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;October saw the launch of &lt;a href="http://www.oldweather.org/"&gt;&lt;span style="font-weight: bold;"&gt;OldWeather&lt;/span&gt;&lt;/a&gt;, a project in which &lt;span style="font-weight: bold;"&gt;GalaxyZoo&lt;/span&gt; applied its crowdsourcing technology to transcription of Royal Navy ship's logs from WWI.  As I write, volunteers have transcribed an astonishing &lt;span class="bottom"&gt;308169&lt;/span&gt; pages of logs — many of which include multiple records.  I hope to do a more detailed review of the software soon, but for now let me note how elegantly the software uses the data itself to engage volunteers, so that transcribers can see the motion of "their ship" on a map as they enter dates, latitudes and longitudes.  This leverages the immersive nature of transcription as an incentive, projecting users deep within history.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-weight: bold;"&gt;North American &lt;/span&gt;&lt;a style="font-weight: bold;" href="http://www.pwrc.usgs.gov/bpp/index.cfm"&gt;Bird Phenology Program&lt;/a&gt; transcribed nearly 160,000 species sighting cards between December 2009 and 2010 and maintained their reputation as a model for crowdsourcing projects by publishing the &lt;a href="http://manuscripttranscription.blogspot.com/2010/12/nabpp-transcription-user-survey-results.html"&gt;first user satisfaction survey for a transcription tool&lt;/a&gt;.  Interestingly the program seems to have followed a growth pattern a bit similar to Wikipedia's, as the cards transcribed rose from &lt;span class="text4"&gt;&lt;a href="http://www.pwrc.usgs.gov/bpp/Newsletters/DecemberNewsletter09.html"&gt;203,967&lt;/a&gt; to &lt;/span&gt;&lt;a href="http://www.pwrc.usgs.gov/bpp/Newsletters/Newsletter_December_2010.pdf"&gt;362,996&lt;/a&gt; while the number of volunteers only increased from &lt;span class="text4"&gt;1,666 to 2,204 (32% vs 78%)  — indicating that a core of passionate volunteers remain the most active contributors.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I've only recently discovered &lt;a href="http://demogen.arch.be/register.php?lang=fr"&gt;&lt;span style="font-weight: bold;"&gt;Demogen&lt;/span&gt;&lt;/a&gt;, a project operated by the &lt;span style="font-weight: bold;"&gt;Belgium Rijksarchief&lt;/span&gt; to enlist the public to index handwritten death notices.  Although most of the documentation is in Flemish, the &lt;a href="http://demogen.arch.be/demogenvisu_handleiding_fr_v1_1_0_0.pdf"&gt;Windows-based transcription software&lt;/a&gt; will also operate in French.  I've had trouble finding statistics on how many of the record sets have been completed (a set comprising a score of pages with half a dozen personal records per page).  By my crude estimate, the 4000ish sets are approximately 63% indexed — say a total of 300,000 records to date.  I'd like to write a more detailed review of Demogen/Visu and would welcome any pointers to project status and community support pages.&lt;br /&gt;&lt;br /&gt;Ancestry.com's &lt;span style="font-weight: bold;"&gt;World Archives Project&lt;/span&gt; has been operating since 2008, but I've been unable to find any statistics on the total number of records transcribed.  The project allows volunteers to index personal information from a fairly &lt;a href="http://www.ancestry.com/wiki/index.php?title=Category:World_Archives_Project"&gt;heterogeneous assortment&lt;/a&gt; of records scanned from microfilm.  Each set of records has its own &lt;a href="http://www.ancestry.com/wiki/index.php?title=World_Archives_Project:_London%2c_England%2c_School_Admissions_and_Discharges%2c_1841-1911"&gt;project page&lt;/a&gt; with &lt;a href="http://www.ancestry.com/wiki/index.php?title=World_Archives_Project:_Langenstein_Zwieberge_Concentration_Camp_Inmate_Cards,_April_1944_to_April_1945"&gt;help and statistics&lt;/a&gt;.  The keying &lt;a href="http://c.ancestry.com/affiliate/Knowledgebase/Guides/Ancestry/WorldArchives_GettingStartedGuide.pdf"&gt;software&lt;/a&gt; is a Windows-based application free for download by any Ancestry.com registered user, while support is provided through &lt;a href="http://boards.ancestry.com/wap.usrecordsother/227/mb.ashx"&gt;discussion boards&lt;/a&gt; and a wiki.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Ongoing Free-form Transcription Projects&lt;/span&gt;&lt;br /&gt;While &lt;a href="http://manuscripttranscription.blogspot.com/2009/07/wikisource-for-manuscript-transcription.html"&gt;I've written&lt;/a&gt; about &lt;span style="font-weight: bold;"&gt;Wikisource&lt;/span&gt; and its &lt;a href="http://en.wikisource.org/wiki/Help:Side_by_side_image_view_for_proofreading"&gt;&lt;span style="font-weight: bold;"&gt;ProofreadPage&lt;/span&gt;&lt;/a&gt; plug-in before, it remains worth very much following.  &lt;a href="http://toolserver.org/%7Ethomasv/statistics.php"&gt;More than two hundred thousand&lt;/a&gt; scanned pages have been proofread, had problems reconciled, and been reviewed out of 1.3 million scanned pages.  Only a tiny percent of those are handwritten, but that's still a few thousand pages, making it the most popular automated free-form transcription tool.&lt;br /&gt;&lt;br /&gt;This blog was started to track my own work developing &lt;span style="font-weight: bold;"&gt;FromThePage&lt;/span&gt; to transcribe &lt;a href="http://beta.fromthepage.com/collection/show?collection_id=1&amp;amp;ol=blog"&gt;&lt;span style="font-weight: bold;"&gt;Julia Brumfield's diaries&lt;/span&gt;&lt;/a&gt;.  As I type, &lt;a href="http://beta.fromthepage.com/"&gt;beta.fromthepage.com&lt;/a&gt; hosts 1503 transcribed pages—of which 988 are indexed and annotated—and volunteers are now waiting on me to prepare and upload more page images.  &lt;a href="http://manuscripttranscription.blogspot.com/2011/01/progress-report-github-archiveorg.html"&gt;Major developments in 2010&lt;/a&gt; included the &lt;a href="https://github.com/benwbrum/fromthepage/wiki"&gt;release&lt;/a&gt; of FromThePage on GitHub under a Free software license and installation of the software by the Balboa Park Online Collaborative for transcription projects by their member institutions.&lt;br /&gt;&lt;br /&gt;Probably the biggest news this year was &lt;span style="font-weight: bold;"&gt;TranscribeBentham&lt;/span&gt;, a project at &lt;span style="font-weight: bold;"&gt;University College London&lt;/span&gt; to crowdsource the transcription of Jeremy Bentham's papers.  This involved the development of &lt;a href="http://www.transcribe-bentham.da.ulcc.ac.uk/td/Transcribe_Bentham"&gt;Transcription Desk&lt;/a&gt;, a MediaWiki-based tool which is slated to be released under an open-source license.  The team of volunteers had transcribed 737 pages of very difficult handwriting when I last consulted the &lt;a href="http://www.transcribe-bentham.da.ulcc.ac.uk/td/Benthamometer"&gt;Benthamometer&lt;/a&gt;.  The Bentham team has done more than any other transcription tool to publicize the field -- explaining their work on their &lt;a href="http://www.ucl.ac.uk/transcribe-bentham/"&gt;blog&lt;/a&gt;, reaching out through the media (including articles in the &lt;a href="http://chronicle.com/blogs/wiredcampus/crowdsourcing-project-hopes-to-make-short-work-of-transcribing-bentham/26829"&gt;Chronicle of Higher Education&lt;/a&gt; and the New York Times), and even highlighting other transcription projects on &lt;a href="http://melissaterras.blogspot.com/2010/03/crowdsourcing-manuscript-material.html"&gt;Melissa Terras's blog&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Halted Transcription Projects&lt;/span&gt;&lt;br /&gt;The &lt;span style="font-weight: bold;"&gt;Historic Journals project&lt;/span&gt; is a &lt;a href="http://journals.byu.edu/"&gt;fascinating tool&lt;/a&gt; for indexing—and optionally transcribing—privately-held diaries and journals.  It's run by Doug Kennard at at Brigham Young University, and you can read about his vision in &lt;a href="http://students.cs.byu.edu/%7Ekennard/kennard_jcdl2009_author_vers.pdf"&gt;this FHT09 paper&lt;/a&gt;.  Technically, I found a couple of aspects of the project to be particularly innovative.  First, the software integrates with ContentDM to display manuscript page images from that system within its own context.  Second, the tool is tightly integrated with FamilySearch, the LDS Church's database of genealogical material.  It uses the FamilySearch API to perform searches for personal or place names, and can then use the FamilySearch IDs to uniquely identify subjects mentioned within the texts.  Unfortunately, because the &lt;a href="https://devnet.familysearch.org/docs/api-overview"&gt;FamilySearch API&lt;/a&gt; is currently limited to LDS members, development on Historic Journals has been temporarily halted.&lt;br /&gt;&lt;br /&gt;Begun as a desktop application in 1998, the &lt;span style="font-weight: bold;"&gt;&lt;a href="http://www.uscript.org/"&gt;uScript&lt;/a&gt; Transcription Assistant&lt;/span&gt; is the longest-running program in the field.  Recently ported over to modern web-based technologies, the system is similar to &lt;a href="http://manuscripttranscription.blogspot.com/2009/06/interview-with-hugh-cayless.html"&gt;Img2XML&lt;/a&gt; and T-PEN in that it links individual transcribed words to the corresponding images within the scanned page.  Although the system is not in use and &lt;a href="https://sourceforge.wpi.edu/sf/wiki/do/viewPage/projects.transcription_mqp/wiki/HomePage"&gt;the source-code&lt;/a&gt; is not accessible outside WPI, you can read papers describing it by WPI  &lt;a href="http://bit.ly/goNvuO"&gt;students in 2003&lt;/a&gt; or &lt;a href="http://bit.ly/hV8tw1"&gt;in 2005&lt;/a&gt; by Fabio Carrera (the faculty member leading the project).   Unfortunately, &lt;a href="http://www.fabiocarrera.com/project/uscript/passed-over-by-google/"&gt;according to Carrera's blog&lt;/a&gt; work on the project has stopped for lack of funding.&lt;br /&gt;&lt;br /&gt;According to the New York Times article, there was an attempt to crowdsource the &lt;a style="font-weight: bold;" href="http://www.papersofabrahamlincoln.org/" title="The Web site of the project."&gt;Papers of Abraham Lincoln&lt;/a&gt;.  The article quotes project director Daniel Stowell explaining that nonacademic transcribers "produced so many errors and  gaps in the papers that 'we were spending more time and money correcting  them as creating them from scratch.'" The &lt;a href="http://isda.ncsa.uiuc.edu/lpapers/index.php?page_ID=236867&amp;amp;num=01-02-03"&gt;prototype&lt;/a&gt; transcription tool (created by NCSA at UIUC) has been abandoned.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Upcoming Transcription Projects&lt;br /&gt;&lt;/span&gt;The Center for History and New Media at George Mason University is developing a transcription tool called &lt;span style="font-weight: bold;"&gt;Scripto&lt;/span&gt; based on MediaWiki and architected around integration with an external CMS for hosting page images.  The initial transcription project will be their &lt;a style="font-weight: bold;" href="http://wardepartmentpapers.org/"&gt;Papers of the War Department&lt;/a&gt; site, but connector scripts for other content management systems are under development.  Scripto is being developed in a particularly open manner, with the source code available for immediate inspection and download on &lt;a href="https://github.com/chnm/Scripto"&gt;GitHub&lt;/a&gt; and a &lt;a href="http://scripto.org/"&gt;project blog&lt;/a&gt; covering the tool's progress.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://digital-editor.blogspot.com/"&gt;&lt;span style="font-weight: bold;"&gt;T-PEN&lt;/span&gt;&lt;/a&gt; is a tool under development by Saint Louis Univiersity to enable line-by-line transcription and paleographic annotation.  It's focused on medieval manuscripts, and automatically identifies the lines of text within a scanned page — even if that page is divided into columns.  The team integrated crowdsourcing into their development process by challenging the public to test and give feedback on their line identification algorithm, gathering perhaps a thousand ratings in a two week period.  There's no word on whether T-PEN will be released under a free license. I should also mention that they've got the &lt;a href="http://digital-editor.blogspot.com/2010/11/t-pens-image.html"&gt;best logo&lt;/a&gt; of any transcription tool.&lt;br /&gt;&lt;br /&gt;I covered &lt;span style="font-weight: bold;"&gt;Militieregisters.nl&lt;/span&gt; at length &lt;a href="http://manuscripttranscription.blogspot.com/2010/12/militieregistersnl-and-velehandennl.html"&gt;below&lt;/a&gt;, but the most recent news is that a vendor &lt;a href="http://militieregisters.nl/?p=406"&gt;has been picked&lt;/a&gt; to develop the &lt;span style="font-weight: bold;"&gt;VeleHanden&lt;/span&gt; transcription tool.  I would not be at all surprised if 2011 saw the deployment of that system.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-weight: bold;"&gt;Balboa Park Online Collaborative&lt;/span&gt; is going into collaborative transcription in a big way with the &lt;a style="font-weight: bold;" href="http://fromthepage.bpoc.org/collection/show?collection_id=1"&gt;Field Notes of Laurence Klauber&lt;/a&gt;&lt;span style="text-decoration: underline;"&gt; &lt;/span&gt;for the San Diego Natural History Museum.  They've picked my own FromThePage to host their transcriptions, and have been driving a lot of the development on that system since October through their enthusiastic feature requests, bug reports, and funding.  Future transcription projects are in the early planning stages, but we're trying to complete features suggested by the Klauber material first.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-weight: bold;"&gt;University of Iowa Libraries&lt;/span&gt; plan to crowdsource transcription of their &lt;a style="font-weight: bold;" href="http://digital.lib.uiowa.edu/diaries/index.php"&gt;Historic Iowa Children's Diaries&lt;/a&gt;.  There is no word on the technology they plan to use.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-weight: bold;"&gt;Getty Research Institute&lt;/span&gt; plans to crowdsource transcription of&lt;span style="font-weight: bold;"&gt; &lt;/span&gt;&lt;a style="font-weight: bold;" href="http://blogs.getty.edu/iris/a-look-inside-j-paul-getty-newly-digitized-diaries/"&gt;J. Paul Getty's diaries&lt;/a&gt;.  This project also appears to be in the very early stages of planning, with no technology chosen.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://thatcampcanberra.org/2010/08/invisible-australians-%E2%80%93-crowdsourcing-and-community-building/"&gt;&lt;span style="font-weight: bold;"&gt;Invisible Australians&lt;/span&gt;&lt;/a&gt; is a digitization project by &lt;span style="font-weight: bold;"&gt;Kate Bagnall&lt;/span&gt; and &lt;span style="font-weight: bold;"&gt;Tim Sherratt&lt;/span&gt; to explore the lives of Australians subjected to the White Australia policy through their extensive records.  While it's still in the planning stages (with only a &lt;a href="http://invisibleaustralians.org/"&gt;set&lt;/a&gt; of &lt;a href="http://discontents.com.au/tag/invisibleaustralians"&gt;project&lt;/a&gt; &lt;a href="http://chineseaustralia.org/?tag=invisible-australians"&gt;blogs&lt;/a&gt; and a &lt;a href="http://www.zotero.org/groups/invisible_australians/items"&gt;Zotero library&lt;/a&gt; publicly visible), the heterogeneity of the source material make it one of the most ambitious documentary transcription projects I've seen.  Some of the data is traditionally structured (like &lt;a href="http://discontents.com.au/wp-content/uploads/2010/09/Charles-Allen-1909-CEDT-front.jpg"&gt;government forms&lt;/a&gt;), some free-form (like &lt;a href="http://discontents.com.au/wp-content/uploads/2010/09/gum-letter1.jpg"&gt;letters&lt;/a&gt;), and there are photographs and even hand-prints to present alongside the transcription!  Invisible Australians will be a fascinating project to follow in 2011.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Obscure Transcription Projects&lt;/span&gt;&lt;br /&gt;Because the field is so fragmented, there are a number of projects I follow that are not entirely automated, not entirely public, not entirely collaborative, moribund or awaiting development.  In fact, some projects have so little written about them online that they're almost mysterious.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Commenters to a &lt;a href="http://rogueclassicism.com/2010/03/28/galaxy-zoo-meets-the-oxyrhynchus-papyri/"&gt;blog post at Rogue Classicism &lt;/a&gt;are discussing &lt;a href="http://apaclassics.org/index.php/placement_service/past_years/C53#Adler%20Planetarium"&gt;this APA job posting&lt;/a&gt; for a Classicist to help develop a new GalaxyZoo project transcribing the &lt;a style="font-weight: bold;" href="http://en.wikipedia.org/wiki/Oxyrhynchus_Papyri"&gt;Oxyrhynchus Papyri&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;Some cryptic comments on blog posts covering TranscribeBentham point to &lt;a href="http://www.fadedpage.com/"&gt;&lt;span style="font-weight: bold;"&gt;FadedPage&lt;/span&gt;&lt;/a&gt;, which appears to be a tool similar to Project Gutenberg's Distributed Proofreaders.  Further investigation has yielded no instances of it being used for handwritten material.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;A blog called &lt;a href="http://thenewcursive.blogspot.com/"&gt;On the Written, the Digital, and the Transcription&lt;/a&gt; tracks development of &lt;span style="font-weight: bold;"&gt;WrittenRummage&lt;/span&gt;, which was apparently a crowdsourced transcription tool that sought to leverage Amazon's Mechanical Turk.&lt;/li&gt;&lt;li&gt;&lt;a style="font-weight: bold;" href="http://geneaknowhow.net/project-papier-digitaal.htm"&gt;Van Papier Naar Digitaal&lt;/a&gt; is a project by Hans den Braber and Herman de Wit in which volunteers photograph or scan handwritten material then send the images to Hans.  Hans reviews them and puts them on the website as a PDF, where Herman publicizes them to transcription volunteers.  Those volunteers download the PDF and use Jacob Boerema's desktop-based  &lt;a href="http://www.jacobboerema.nl/Transcript/Freeware.htm"&gt;Transcript software&lt;/a&gt; to transcribe the records, which are then linked from &lt;a href="http://geneaknowhow.net/digi/bronnen.html"&gt;Digitale Bronbewerkinge Nederland en België&lt;/a&gt;.  With my limited Dutch it is hard for me to evaluate how much has been completed, but in the years that the program has been running its results seem to have been pretty impressive.&lt;/li&gt;&lt;li&gt;BYU's &lt;span style="font-weight: bold;"&gt;Immigrant Ancestors Project&lt;/span&gt; was &lt;a href="http://books.google.com/books?id=OgTFjwSqdWwC&amp;amp;lpg=PR10&amp;amp;ots=jOI_phaiB5&amp;amp;dq=immigrant%20ancestors%20project&amp;amp;pg=PR9#v=onepage&amp;amp;q=immigrant%20ancestors%20project&amp;amp;f=false"&gt;begun in 1996&lt;/a&gt; as a survey of German archival holdings, then was expanded into a crowdsourced indexing project.  A &lt;a href="http://fht.byu.edu/prev_workshops/workshop09/papers/2-1-Witmer.pdf"&gt;2009 article by Mark Witmer&lt;/a&gt; predicts the immanent roll-out of a new version of the indexing software, but the &lt;a href="http://immigrants.byu.edu/users/new"&gt;project website&lt;/a&gt; looks quite stale and says that it's no longer accepting volunteers.&lt;/li&gt;&lt;li&gt;In November, &lt;a href="http://groups.google.com/group/islandora/browse_thread/thread/ab9677242b8d16ef?pli=1"&gt;a Google Groups post&lt;/a&gt; highlighted the use of &lt;span style="font-weight: bold;"&gt;Islandora&lt;/span&gt; for side-by-side presentation of a page image and a TEI editor for transcription.  However I haven't found any examples of its use for manuscript material.&lt;/li&gt;&lt;li&gt;&lt;a style="font-weight: bold;" href="http://wiktenauer.com/wiki/Main_Page"&gt;Wiktenauer&lt;/a&gt; is a MediaWiki installation for fans of western martial arts.  It hosts &lt;a href="http://wiktenauer.com/wiki/Category:Transcription"&gt;several projects&lt;/a&gt; transcribing and translating &lt;a href="http://wiktenauer.com/wiki/Martin_H%C3%BCndsfelder"&gt;medieval manuals of fighting and swordsmanship&lt;/a&gt;, although I haven't yet figured out whether they're automating the transcription.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Melissa Terras' manuscript transcription blog post mentioned a Drupal-based tool called &lt;span style="font-weight: bold;"&gt;OpenScribe&lt;/span&gt;, built by the &lt;a style="font-weight: bold;" href="http://www.nzetc.org/"&gt;New Zealand Electronic Text Centre&lt;/a&gt;.  However, the &lt;a href="http://code.google.com/p/openscribe/"&gt;Google Code site&lt;/a&gt; doesn't show any updates since mid-2009, so I'm not sure how active the project is.  This project is particularly difficult to research because "OpenScribe" is also the name chosen for an audio transcription tool hosted on SourceForge as well as a commercial scanning station.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;I welcome any corrections or updates on these projects.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Predictions for 2011&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Emerging Community&lt;/span&gt;&lt;br /&gt;Nearly all of the transcription projects I've discussed were begun in isolation, unaware of previous work towards transcription tools.  While I expect this fragmented situation to continue--in fact I've seen isolated proposals as recently as &lt;a href="http://www.hastac.org/blogs/shawnwmoore/couple-new-project-ideas"&gt;Shawn Moore's October 12 HASTAC post&lt;/a&gt;--it should lessen a bit as toolmakers and project managers enter into dialogue with each other on &lt;a href="http://digitalhumanities.org/answers/topic/collaborative-software-for-transcribing-digital-images-of-handwritten-documents"&gt;comment threads&lt;/a&gt;, &lt;a href="https://docs.google.com/document/d/1CbXW-KuuGsfRvpdu8ZdmAUudO4zNVGhRMOx8SdB1yVc/edit?hl=en&amp;amp;authkey=CK6a5f0L#"&gt;conference panels&lt;/a&gt; or &lt;a href="https://github.com/"&gt;GitHub&lt;/a&gt;.  Tentative steps were made towards overcoming linguistic division in 2010, with Dutch archivists covering TranscribeBentham and a scattered bit of bloggy conversation between Dutch, German, English and American participants.  The publicity given to projects like OldWeather, Scripto, and TranscribeBentham can only help this community form.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;No Single Tool&lt;/span&gt;&lt;br /&gt;We will not see the development of a single tool that supports transcription of both structured and free-form manuscripts, nor both paleographic and semantic annotation in 2011.  The field is too young and fragmented -- most toolmakers have enough work providing the basic functionality required by their own manuscripts.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;New Client-side Editors&lt;/span&gt;&lt;br /&gt;Although I don't foresee convergence of server tools, there is already some exciting work being done towards Javascript-based editors for TEI, the mark-up language that informs most manuscript annotation.  &lt;a href="https://github.com/jdickie/TEILiteEditor"&gt;TEILiteEditor&lt;/a&gt; is an open-source WYSIWYG for editing TEI, while &lt;a href="https://github.com/dougreside/RaiseXML"&gt;RaiseXML&lt;/a&gt; is an open-source editor for manipulating TEI tags directly.  Both projects have seen a lot of activity over the past few weeks, and it's easy to imagine a future in which many different transcription tools support the same user-facing editor.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;External Integration&lt;/span&gt;&lt;br /&gt;2010 already saw strides being made towards integration with external CMSs, with BYU's Historic Journals serving page images from ContentDM and FromThePage serving page images from the Internet Archive.  Scripto is apparently designed entirely around CMS integration, as it does not host images itself and is architected to support connectors for many different content management systems.  I feel that this will be a big theme of transcription tool development in 2011, with new support for feeding transcriptions and text annotations back to external CMSs.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Outreach/Volunteer Motivation&lt;/span&gt;&lt;br /&gt;We're learning that a key to success in crowdsourcing projects is recruiting volunteers.  I think that 2011 will see a lot of attention paid to identifying and enlisting existing communities interested in the subject matter for a transcription project.  In addition to finding volunteers, projects will better understand volunteer motivation and the trade-offs between game-like systems that encourage participation through score cards and points on the one hand, and immersive systems that enhance the volunteers' engagement with the text on the other.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Taxonomy&lt;/span&gt;&lt;br /&gt;As number of transcription projects multiplies, I think that we will be able to start generalizing from the unique needs of each collection of manuscript material to form a sort of taxonomy of transcription projects.  In the list above, I've separated the projects indexing structured data like militia rolls from those dealing with free-form text like diaries or letters.  I think that in 2011 we'll be able to classify projects by their paleographic requirements, the kinds of analysis that will be performed on the transcribed texts, the quantity of non-textual images that must be incorporated into the transcription presentation, and other dimensions.  It's possible that the existing tools will specialize in a few of these areas, providing support for needs similar to those of their original project so that a sort of decision tree could guide new projects toward the appropriate tool for their manuscript material.&lt;br /&gt;&lt;br /&gt;2011 is going to be a great year!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-6691934409623254185?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/6cz_zPjERJQ" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2011/02/2010-year-of-crowdsourcing.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>8</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-7655543714991713580</guid><pubDate>Tue, 04 Jan 2011 11:47:00 +0000</pubDate><atom:updated>2011-01-04T06:27:49.093-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">progress</category><title>Progress Report: GitHub, Archive.org Integration, and General Availability</title><description>2010 saw big changes in FromThePage. &lt;br /&gt;&lt;ul&gt;&lt;li&gt;The &lt;a href="http://www.bpoc.org/"&gt;Balboa Park Online Collaborative&lt;/a&gt; started using FromThePage to transcribe the field notes of herpetologist &lt;a href="http://en.wikipedia.org/wiki/Laurence_Klauber"&gt;Laurence Klauber&lt;/a&gt;.  Perian Sully, Rich Cherry, and all the other folks there have been fantastic to work with: full of enthusiasm and new ideas for the system while patient with the bugs that we've discovered.  This is the first institution to install FromThePage, and their needs have driven a lot of development since October, including&lt;/li&gt;&lt;li&gt;Internet Archive integration: As you can see on the &lt;a href="http://fromthepage.bpoc.org/display/display_page?ol=d_act_page&amp;amp;page_id=7340"&gt;Klauber site&lt;/a&gt;, FromThePage now integrates directly with books hosted on the Internet Archive.  This means that FromThePage gets to use the BookReader (in modified form) with its spiffy zoom and pan capabilities while delegating the expensive work of image hosting to Archive.org.  It also reduces duplication of data and may enhance findability of the transcriptions.  Best of all, the tedious process of uploading, assembling, and titling page images can be skipped, as FromThePage now &lt;a href="https://github.com/benwbrum/fromthepage/wiki/Importing-Works-from-the-Internet-Archive"&gt;imports the book structure and even the OCRed page titles from Archive.org derivative files&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;As you can see from that last link, I've &lt;a href="https://github.com/benwbrum/fromthepage/wiki"&gt;transferred FromThePage over to GitHub&lt;/a&gt;, released it under the Affero GPL, and created some extensive documentation on the wiki.  So FromThePage is officially Free software, available for immediate use.&lt;/li&gt;&lt;/ul&gt;If you're interested in hosting a transcription project on FromThePage, drop me a line at benwbrum@gmail.com and I'll help you get started.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-7655543714991713580?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/IgQH63fAuyY" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2011/01/progress-report-github-archiveorg.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-5496137817866140129</guid><pubDate>Thu, 30 Dec 2010 15:34:00 +0000</pubDate><atom:updated>2010-12-30T09:50:01.539-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">similar projects</category><category domain="http://www.blogger.com/atom/ns#">blogland</category><title>Two New Diary Transcription Projects</title><description>The last few weeks have seen the announcement of two new transcription projects.  I'm particularly excited about them because--like &lt;a href="http://beta.fromthepage.com/"&gt;FromThePage&lt;/a&gt;--their manuscripts are diaries and they plan to open the transcription tasks to the public.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://blog.lib.uiowa.edu/dls/2010/11/22/dear-diary/"&gt;&lt;span style="font-weight: bold;"&gt;Dear Diary&lt;/span&gt;&lt;/a&gt; announces the digitization of University of Iowa's &lt;a href="http://digital.lib.uiowa.edu/diaries/index.php"&gt;Historic Iowa Children's Diaries:&lt;/a&gt;&lt;br /&gt;&lt;blockquote&gt;We have a deep love for historic diaries as well, and we’re currently hard at work developing a site that will allow the public to help enhance our collections through “crowdsourcing” or collaborative transcription of diaries and other manuscript materials. Stay tuned!&lt;br /&gt;&lt;/blockquote&gt;&lt;a href="http://blogs.getty.edu/iris/a-look-inside-j-paul-getty-newly-digitized-diaries/"&gt;&lt;b&gt;A Look Inside J. Paul Getty’s Newly Digitized Diaries&lt;/b&gt;&lt;/a&gt; describes J. Paul Getty's diaries and their contents, and mentions in passing that&lt;blockquote&gt;We will soon launch a website that will invite your participation to perform the transcriptions (via crowdsourcing), thus rendering the diaries keyword-searchable and dramatically improving their accessibility.&lt;/blockquote&gt;I will be very interested to follow these projects and see which transcription systems they use or develop.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-5496137817866140129?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/wltyYept0mQ" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2010/12/two-new-diary-transcription-projects.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-1572185923055399160</guid><pubDate>Wed, 15 Dec 2010 14:10:00 +0000</pubDate><atom:updated>2010-12-15T09:07:06.861-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">nabpp</category><category domain="http://www.blogger.com/atom/ns#">similar projects</category><title>NABPP Transcription User Survey Results</title><description>The &lt;a href="http://manuscripttranscription.blogspot.com/2009/05/review-usgs-north-american-bird.html"&gt;ever-innovative &lt;/a&gt;North American Bird Phenology Program has just released the &lt;a href="http://www.pwrc.usgs.gov/bpp/SatisfactionSurveyReport2010/Satisfaction_Survey_2010.BK.html"&gt;results of their user satisfaction survey&lt;/a&gt;.  In addition to offering kudos to the NABPP for pleasing its users--who had transcribed 346,786 species observation cards &lt;a href="http://www.pwrc.usgs.gov/bpp/Newsletters/Newsletter_November_2010.pdf"&gt;as of November&lt;/a&gt;--I'd like to highlight some of the survey results -- after all, this is the first user survey for a transcription tool that I'm aware of.&lt;br /&gt;&lt;br /&gt;This chart shows answers to the question "What inspires you to volunteer?":&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.pwrc.usgs.gov/bpp/SatisfactionSurveyReport2010/Slide5.GIF"&gt;&lt;img 0pt="" 10px="" src="http://www.pwrc.usgs.gov/bpp/SatisfactionSurveyReport2010/Slide5.GIF" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Users overwhelmingly cite the importance of the program and their love of nature.  The importance response obviously speaks to the "save the world" aspect of the NABPP's mission tracking climate change.  But how does a love of nature inspire someone to sit in front of a computer and decipher &lt;a href="http://www.pwrc.usgs.gov/bpp/Charts2.cfm"&gt;seventy thousand&lt;/a&gt; pieces of old hand-writing?  Based on my experience with the NABPP (and indeed with &lt;a href="http://beta.fromthepage.com/collection/show?collection_id=1&amp;amp;ol=blog"&gt;Julia Brumfield's diaries&lt;/a&gt;), I think the answer is simple: transcription is an immersive activity.  It is rare that we read more deeply than when we transcribe a text, so the transcriber is transported to a different place and time in the same way as a reader lost in a novel or a player in a good video game.&lt;br /&gt;&lt;br /&gt;While the whole survey is worth reading, one other response stood out for me.  Answering the open-ended "how can we improve" question, one respondent requested &lt;span style="font-style: italic;"&gt;examples of properly transcribed "difficult" cards on the transcription page&lt;/span&gt;.  I know that I as a tool-maker tend to concentrate on providing users help using the software I develop.  However, transcription tools need to provide users with non-technical guidance on paleographic challenges, unusual formatting, and other issues that may be unique to the material being transcribed.  I'm not entirely sure how to accomplish this as a developer, other than by facilitating communication among transcribers, editors, and project coordinators.&lt;br /&gt;&lt;br /&gt;Let me conclude by offering the NABPP my congratulations on their success and my thanks for their willingness to share these results with the rest of us.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-1572185923055399160?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/Ql5s6xReAxQ" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2010/12/nabpp-transcription-user-survey-results.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-2368398586986076777</guid><pubDate>Fri, 10 Dec 2010 12:16:00 +0000</pubDate><atom:updated>2010-12-10T09:46:24.189-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">velehanden</category><category domain="http://www.blogger.com/atom/ns#">similar projects</category><category domain="http://www.blogger.com/atom/ns#">blogland</category><title>Militieregisters.nl and Velehanden.nl</title><description>&lt;a href="http://militieregisters.nl/" target="_blank"&gt;Militieregisters.nl&lt;/a&gt; is a new transcription project organized by the City Archive Amsterdam that plans to use crowdsourcing to index militia registers from several Dutch archives.  It's quite ambitious, and there are a number of innovative features about the project I'd like to address.  However, I haven't seen any English-language coverage of the project so I'll try to translate and summarize it as best as my limited Dutch and Google's imperfect algorithms allow before offering my own commentary.&lt;br /&gt;&lt;div id=":xf"&gt;&lt;br /&gt;&lt;i&gt;With the project "many hands make light work", Stadsarchief Amsterdam will make all militia records searchable online through crowdsourcing -- not just inventories, but indexes.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://militieregisters.nl/?page_id=166" target="_blank"&gt;About&lt;/a&gt;&lt;br /&gt;&lt;blockquote&gt;To research how archives and online users &lt;span&gt;can work together to improve access to the  archives, the &lt;/span&gt;&lt;span&gt;&lt;span style="direction: ltr; text-align: left;"&gt;Stadsarchief Amsterdam has set up the "Many Hands" project.  With this project, we want to create a platform where all Dutch archives can offer their scans to be indexed and where all archival users can &lt;/span&gt;&lt;/span&gt;contribute in exchange for fun, contacts, information, honor, scanned goods, and whatever else we can think of.&lt;br /&gt;&lt;br /&gt;To ask the whole Netherlands to index, we must start with archives that are important to the whole Netherlands.  As the first pilot, we have chosen the Militia Registers, but there will soon be more archival files to be indexed so that everyone can choose something within his interest and skill-level. &lt;/blockquote&gt;&lt;span&gt;&lt;/span&gt;&lt;p&gt;&lt;a href="http://militieregisters.nl/?page_id=45" target="_blank"&gt;&lt;span&gt;All Militia Registers Online&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;&lt;span&gt;Militia registers contain the records of all boys who were entered for conscription into military service during almost the entire 19th and part of the 20th centuries.  These records were created in the entire Netherlands and are kept in many national and municipal archives.&lt;br /&gt;&lt;/span&gt; &lt;/p&gt;&lt;p&gt;&lt;span&gt;&lt;span style="direction: ltr; text-align: left;"&gt;&lt;/span&gt;&lt;b&gt; &lt;/b&gt;&lt;/span&gt;&lt;span&gt;The militia records are eminently suitable for large-scale digitization.  The records consist of printed sheets.  This uniformity makes scanning easy and thus inexpensive.  More importantly, this resource is interesting for anyone with Dutch ancestry.  Therefore we expect many volunteers to lend a hand to help unlock this wonderful resource, and that the online indexes will eventually attract many visitors.&lt;br /&gt;&lt;/span&gt; &lt;/p&gt;But the first step is to scan the records.  Soon the scanning of approximately one million pages will begin as a start.  The more records we have digitized, the cheaper the scanning becomes, and the more attractive the indexing project becomes to volunteers.  The Stadsarchive therefore calls upon all Dutch archival institutions to join!&lt;/blockquote&gt;&lt;a href="http://militieregisters.nl/?page_id=273" target="_blank"&gt;FAQ&lt;/a&gt;&lt;blockquote&gt;&lt;ul&gt;&lt;li&gt;&lt;span&gt;&lt;b&gt;At our institution, online scans are provided for free.  Why should people pay for scans&lt;/b&gt;&lt;/span&gt;&lt;span&gt;&lt;b&gt;?&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span&gt;Revenues from sales of scans over the two years duration of the project are a part of the financing of the project.&lt;/span&gt; &lt;span&gt;The budget is based on the  rates as used in the City Archives: € 0.50 to € 0.25 per scan, depending  on the number of scans that someone buys.&lt;/span&gt;&lt;span&gt;  We ask the institutions that participate throughout the project do not sell &lt;/span&gt;&lt;span&gt;their own scans &lt;/span&gt;&lt;span&gt;or make them available for free.&lt;/span&gt; &lt;span&gt;After the completion of the project, each institution may follow its own policy for providing the scans.&lt;/span&gt;&lt;/li&gt;&lt;li&gt; &lt;span&gt;&lt;span style="direction: ltr; text-align: left;"&gt;&lt;/span&gt;&lt;b&gt;If we participate, who is the owner of the scans and index data?&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span&gt;After production and payment, t&lt;/span&gt;&lt;span&gt;he scans will be delivered immediately to the institution which provided the militia records.&lt;/span&gt; &lt;span&gt;The index information will also be supplied to the institutions &lt;/span&gt;&lt;span&gt;after completion of the project&lt;/span&gt;&lt;span&gt;.&lt;/span&gt; &lt;span&gt;The institution remains the owner, but &lt;/span&gt;&lt;span&gt;during the project period of approximately two years &lt;/span&gt;&lt;span&gt;the material may not be used  outside of the project.&lt;/span&gt;&lt;/li&gt;&lt;li&gt; &lt;span&gt;&lt;span style="direction: ltr; text-align: left;"&gt;&lt;/span&gt;&lt;b&gt;What are the&lt;/b&gt;&lt;/span&gt;&lt;span&gt;&lt;b&gt; financial risks&lt;/b&gt;&lt;/span&gt;&lt;span&gt;&lt;b&gt; for participating&lt;/b&gt;&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;&lt;b&gt; archives&lt;/b&gt;&lt;/span&gt;&lt;span&gt;&lt;b&gt;?&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span&gt;Participants pay only for their scans: the actual costs and preparation of the scanning process.&lt;/span&gt; &lt;span&gt;The development and deployment of the index tool, volunteer  recruitment and two years maintenance of the website from the project has been&lt;/span&gt;&lt;span&gt; funded by &lt;/span&gt;&lt;span&gt;grants and contributions by &lt;/span&gt;&lt;span&gt;&lt;span style="direction: ltr; text-align: left;"&gt;Stadsarchief  Amsterdam&lt;/span&gt;&lt;/span&gt;&lt;span&gt;.&lt;/span&gt; &lt;span&gt;There are no financial surprises.&lt;/span&gt;&lt;/li&gt;&lt;li&gt; &lt;span&gt;&lt;span style="direction: ltr; text-align: left;"&gt;&lt;/span&gt;&lt;b&gt;What does the schedule for the project look like?&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&lt;span style="direction: ltr; text-align: left;"&gt;&lt;/span&gt;On July 12 and September 13 we are organizing meetings with potential participants to answer your questions.&lt;/span&gt;&lt;span&gt;&lt;span style="direction: ltr; text-align: left;"&gt;&lt;/span&gt; Until October 1, 2010, participants will sign up to  participate in the project, in order for the scanning to start on that day.&lt;/span&gt; &lt;span&gt;The tender process runs about 2 months, so&lt;/span&gt;&lt;span&gt; a supplier can be contracted&lt;/span&gt;&lt;span&gt; in 2010.&lt;/span&gt;&lt;span&gt;&lt;span style="direction: ltr; text-align: left;"&gt;&lt;/span&gt; In January 2011 we will start scanning, volunteers can begin&lt;/span&gt;&lt;span&gt; indexing &lt;/span&gt;&lt;span&gt;in the spring&lt;/span&gt;&lt;span&gt;.&lt;/span&gt; &lt;span&gt;The sister site &lt;a href="http://www.velehanden.nl/" target="_blank"&gt;www.velehanden.nl&lt;/a&gt;--where the indexing will take place--will continue online for at least one year.&lt;/span&gt; &lt;/li&gt;&lt;li&gt; &lt;span&gt;&lt;span style="direction: ltr; text-align: left;"&gt;&lt;b&gt;Will the indexing tool be developed as Open Source software&lt;/b&gt;&lt;/span&gt;&lt;b&gt;?&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span&gt;&lt;span style="direction: ltr; text-align: left;"&gt;&lt;/span&gt;It is currently impossible to say whether the indexing tool&lt;/span&gt;&lt;span&gt; will be developed&lt;/span&gt;&lt;span&gt; via/as open source software.&lt;/span&gt; &lt;span&gt;&lt;span style="direction: ltr; text-align: left;"&gt;&lt;/span&gt;Of primary importance is finding the most cost-effective solution and that the software performs well and is user-friendly.&lt;/span&gt; &lt;span&gt;The only hard requirement is the use of open standards for the import and export of metadata, so that vendor independence is guaranteed.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;&lt;a href="http://militieregisters.nl/wp-content/uploads/2010/10/RFP-VeleHanden.nl_.pdf" target="_blank"&gt;RFP (Warning: very loose translation!)&lt;/a&gt;&lt;br /&gt;&lt;blockquote&gt;Below are some ideas SAA has formulated regarding the functionality and sustainability of VeleHanden.nl:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Facilities for importing and managing scans, and for exporting data in XML format.&lt;/li&gt;&lt;li&gt;Scan viewer with advanced features.&lt;/li&gt;&lt;li&gt;Functionality to simultaneously run multiple projects for indexing, transcription, and translation of scans.&lt;/li&gt;&lt;li&gt;Features for organizing and managing data from volunteer groups and for selectively enabling features for participants and volunteer coordinators.&lt;/li&gt;&lt;li&gt;Features for communication between archival staff and volunteers, as well as for volunteers to provide support to each other.&lt;/li&gt;&lt;li&gt;Automated features for control of the data produced.&lt;/li&gt;&lt;li&gt;Rewards system (material and immaterial) for volunteers.&lt;/li&gt;&lt;li&gt;Many volunteers may work in parallel to process scans quickly and effectively.&lt;/li&gt;&lt;li&gt;Facilities to search, view and share scans online.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;Other Dutch bloggers have covered the unique approach that Stadsarchief Amsterdam envisions for volunteer motiviation and project support:  &lt;span style="font-weight: bold;"&gt;Militieregisters.nl users who want to download scans may either pay for them in cash or in labor, by indexing &lt;span style="font-style: italic;"&gt;N&lt;/span&gt; scanned pages&lt;/span&gt;.  Christian van der Ven's blog post &lt;a href="http://www.digitalearchivaris.nl/2010/07/crowdsourcen-rond-militieregisters.html"&gt;Crowdsourcen rond militieregisters&lt;/a&gt; and the associated comment thread discusses this intensely and is worth reading in full.  Here's a loosly-translated excerpt:&lt;br /&gt;&lt;blockquote&gt;&lt;span&gt;The project assumes that it can not allow the volunteer to indicate whether he wants to index Zeeland or Groningen.&lt;/span&gt;  &lt;span&gt;It is--in the words of  the project leader--the Orange feeling, to see if the rural people can volunteer  and not just concentrate on their own location.&lt;/span&gt; &lt;span&gt;Indexing people from their own village?&lt;/span&gt; &lt;span&gt; Please, not that!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;&lt;span class="google-src-text" style="direction: ltr; text-align: left;"&gt;&lt;/span&gt;Well  &lt;/span&gt;&lt;span&gt;since the last World Cup &lt;/span&gt;&lt;span&gt;I'm feeling Orange again, but overall experience and research in archives teaches that all country people are  more interested in the history of themselves, their own ancestors, their  homes and the surrounding area.&lt;/span&gt; &lt;span&gt;The closer [the data], the more motivation to do something.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;&lt;span class="google-src-text" style="direction: ltr; text-align: left;"&gt;&lt;/span&gt;And if the purpose of this project is to build an indexing tool, to scan registers, and then to obtain indexes through crowdsourcing  as quickly as possible, it seems to me that the public should be given what it wants: local resources if desired.&lt;/span&gt; &lt;span&gt;&lt;span class="google-src-text" style="direction: ltr; text-align: left;"&gt;&lt;/span&gt;What I suggest is a choice menu: do you want records from your source environment?&lt;/span&gt; &lt;span&gt;&lt;span class="google-src-text" style="direction: ltr; text-align: left;"&gt;&lt;/span&gt;Do you want them maybe only from a certain period?&lt;/span&gt; &lt;span&gt;Or do you want them filtered by time and place?&lt;/span&gt; &lt;span&gt;That kind of choice will trigger as many people as possible to participate, I think.&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;&lt;span style="font-weight: bold;"&gt;My Observations:&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The pay-or-transcribe approach for acquiring scans is a really innovative approach.  Offering people alternatives for supporting the project is a great way of serving the varied constituencies  that compose genealogical researchers, allowing cash-poor, time-rich users (like retirees) an easy way to access the project.&lt;/li&gt;&lt;li&gt;Although I have no experience in the subject, I suspect that this federated approach to digitization--taking structurally-similar material from regional archives and scanning/hosting it centrally--has a lot of possibilities.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Christian's criticism is quite valid, and drives right to the paradox of motivation in crowdsourcing: do you strive for breadth using external incentives like scoreboards and free recognition, or do you strive for depth and cultivate passionate users through internal incentives like deep engagement with the source material?  Volunteer motivation and the trade-offs involved is a fascinating topic, and I hope to do a whole post on it soon.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;One potential flaw is that it will be very hard to charge to view the scans when transcribers must be able to see the scans to do their indexing.  I gather that the randomization in VeleHanden will address this.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The budget described in the RFP is maximum 150000 Euros.  As a real-life software developer, it's hard for me to see how this would pay for building a transcription tool, index  database, scan import tool, scan CMS, search database and (since they expect to sell the  searched scans) eCommerce.  And that  includes running servers too!&lt;/li&gt;&lt;li&gt;This is yet another project that's transcribing structured data from tabular sources, which would benefit from the &lt;a href="http://manuscripttranscription.blogspot.com/2007/05/familysearch-indexer-review.html"&gt;FamilySearch Indexer&lt;/a&gt;, if only it were open-source (or even for sale).&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-2368398586986076777?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/C7FHcPkB_ls" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2010/12/militieregistersnl-and-velehandennl.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>5</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-8865994768597798254</guid><pubDate>Sat, 04 Dec 2010 17:32:00 +0000</pubDate><atom:updated>2010-12-04T12:33:05.854-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">mediawiki</category><category domain="http://www.blogger.com/atom/ns#">similar projects</category><category domain="http://www.blogger.com/atom/ns#">rails</category><title>Errata</title><description>Two important errata in previous posts are worth highlighting:&lt;br /&gt;&lt;h3&gt;Wikisource for Manuscript Transcription&lt;/h3&gt;Commenters on my most recent post, &lt;a href="http://manuscripttranscription.blogspot.com/2009/07/wikisource-for-manuscript-transcription.html"&gt;"Wikisource for Manuscript Transcription"&lt;/a&gt; have pointed out that the Wikisource rule prohibiting the addition of unpublished works--thereby almost entirely prohibiting manuscript transcription projects--is specific to the language domains.  The English and French language Wikisource projects enforce the prohibition, but the German and Hebrew language Wikisource projects do not.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/User:Dovi"&gt;Dovi&lt;/a&gt;, a Wikipedia editor whom I used to bump into on Ancient Near Eastern language articles, points to his work on the the Wikisource edition of Arukh Hashulchan, pointing to &lt;a href="http://he.wikisource.org/wiki/SIMANIM"&gt;this example of &lt;i&gt;simanim&lt;/i&gt;&lt;/a&gt;.  Sadly, my Hebrew isn't up to the task of commenting on this effort. &lt;br /&gt;&lt;br /&gt;I was delighted to see that the post also inspired &lt;a href="http://de.wikisource.org/wiki/Wikisource:Skriptorium/Archiv/2010/Oktober#Feedback_aus_Texas"&gt;a bit of commentary&lt;/a&gt; on the German-language Wikisource Skriptorium.  In particular, WikiAnika seemed to agree with my criticism of flat transcription: &lt;i&gt;Man scheint in gewisser Weise noch an der Printform „zu kleben“. Oder um einen Dritten zu zitieren: „WS ist eine Sackgasse – man findet vielleicht zu einem interessanten Text hin, aber man kommt nicht mehr weiter.“&lt;/i&gt;&lt;br /&gt;&lt;h3&gt;A Short Introduction to &lt;code&gt;before_filter&lt;/code&gt;&lt;/h3&gt;In my &lt;a href="http://manuscripttranscription.blogspot.com/2007/06/rails-short-introduction-to.html"&gt;2007 post on Rails filters&lt;/a&gt;, I mentioned using filters to authenticate users:&lt;blockquote&gt;Filters are called filters because they return a Boolean, and if that return value is false, the action is never called. You can use the the &lt;code&gt;logged_in?&lt;/code&gt; method of &lt;code&gt;:acts_as_authenticated&lt;/code&gt; to prohibit access to non-users — just add &lt;code&gt;before_filter :logged_in?&lt;/code&gt; to your controller class and you're set!&lt;/blockquote&gt;Thanks to some changes made in Rails 2.0, this is not just wrong but &lt;b&gt;dangerously wrong&lt;/b&gt;.  Rails now ignores filter return values, allowing unauthorized access to your actions if you followed my advice.&lt;br /&gt;&lt;br /&gt;Because Rails no longer pays any attention to the return values of controller filters, I've had to replace all my &lt;code&gt;return &lt;i&gt;condition&lt;/i&gt;&lt;/code&gt; statements with &lt;code&gt;unless &lt;i&gt;condition&lt;/i&gt; redirect_to &lt;i&gt;somewhere&lt;/i&gt;&lt;/code&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-8865994768597798254?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/iqwImWLLrNw" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2010/12/errata.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-3860588161432735050</guid><pubDate>Wed, 21 Jul 2010 13:17:00 +0000</pubDate><atom:updated>2010-07-21T09:53:57.380-05:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">mediawiki</category><category domain="http://www.blogger.com/atom/ns#">similar projects</category><title>Wikisource for Manuscript Transcription</title><description>Of the crowdsourcing projects that have real users doing manuscript transcription, one of the largest is an offshoot of &lt;span class="il"&gt;Wikisource&lt;/span&gt;.  &lt;a href="http://www.mediawiki.org/wiki/Extension:Proofread_Page"&gt;ProofreadPage&lt;/a&gt; was an extension to MediaWiki &lt;a href="http://fr.wikisource.org/wiki/Aide:Affichage_par_pages"&gt;created around 2006&lt;/a&gt; on the French-language &lt;span class="il"&gt;Wikisource&lt;/span&gt; as a &lt;span class="il"&gt;Wikisource&lt;/span&gt;/Internet Archive replacement for Project Gutenberg's Distributed Proofreaders.  They were taking &lt;a href="http://www.mediawiki.org/wiki/Manual:How_to_use_DjVu_with_MediaWiki"&gt;DjVu&lt;/a&gt; files from the InternetArchive and using them as sources (via OCR and correction) for &lt;span class="il"&gt;WikiSource&lt;/span&gt; pages.  This spread to the other &lt;span class="il"&gt;Wikisource&lt;/span&gt; sites around 2008, radically changing the way &lt;span class="il"&gt;Wikisource&lt;/span&gt; worked.  More recently the German &lt;span class="il"&gt;Wikisource&lt;/span&gt; has started using ProofreadPage for &lt;a href="http://de.wikisource.org/wiki/Friedrich_Accum_an_Johann_Gottlob_Nathusius_%2822._August_1822%29"&gt;letters&lt;/a&gt;, &lt;a href="http://archiv.twoday.net/stories/6030643/"&gt;pamphlets, and broadsheets&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_UZtU3sSV0MY/TEcEr1ISitI/AAAAAAAAACY/45kOBReGxaw/s1600/typesetting_screenshot.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: right; cursor: pointer; width: 400px; height: 149px;" src="http://3.bp.blogspot.com/_UZtU3sSV0MY/TEcEr1ISitI/AAAAAAAAACY/45kOBReGxaw/s400/typesetting_screenshot.jpg" alt="" id="BLOGGER_PHOTO_ID_5496367021271714514" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The best example of ProofreadPage for handwriting is Winkler's &lt;a href="http://bit.ly/b4H8ZF"&gt;Remarks on the Russian Campaign 1812-1813&lt;/a&gt;.  First, the presentation is lovely.  They've dealt with a typographically difficult text and are presenting alternate typefaces, illustrations, and even marginalia in a clear way in the transcription.  The page numbers link to the images of the pages, and they're come up with transcription conventions which are clearly presented at the top of the text.  This is impressive for a volunteer-driven edition!&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_UZtU3sSV0MY/TEcIS-KLPfI/AAAAAAAAACw/Vpdfg3v5nr8/s1600/paragraph_screenshot.jpg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 218px; height: 164px;" src="http://4.bp.blogspot.com/_UZtU3sSV0MY/TEcIS-KLPfI/AAAAAAAAACw/Vpdfg3v5nr8/s400/paragraph_screenshot.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5496370992245325298" /&gt;&lt;/a&gt;Technically, the Winkler example illustrates ProofreadPage's solution to a difficult problem: how to organize and display pages, sections, and works in the appropriate context.  This is not an issue that I've encountered with &lt;a href="http://beta.fromthepage.com/?ol=blog"&gt;FromThePage&lt;/a&gt;—the &lt;a href="http://beta.fromthepage.com/display/read_work?ol=blog&amp;amp;work_id=6"&gt;Julia Brumfield Diaries&lt;/a&gt; are organized with only &lt;a href="http://beta.fromthepage.com/display/display_page?ol=blog&amp;amp;page_id=1127"&gt;one entry per page&lt;/a&gt;—but I've worried about it since XML is so poorly suited to represent overlapping markup.  When viewing Winkler as a work, paragraphs span multiple manuscript pages but are aggregated seamlessly into the text: search for "sind den Sommer" and you'll find a paragraph with a page-break in the middle of it, indicated by the hyperlink "[23]".  Clicking on &lt;a href="http://de.wikisource.org/wiki/Seite:Tagebuch_Russlandfeldzug_0033.jpg"&gt;the page in which that paragraph begins&lt;/a&gt; shows the page and page image in isolation, along with footnotes about the page source and page-specific information about the status of the transcription.  This is accomplished by programmatically stitching pages together into the work display while excluding page-specific markup via a &lt;code&gt;noinclude&lt;/code&gt; tag.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_UZtU3sSV0MY/TEcFjfDqtgI/AAAAAAAAACo/CIYrQuBRIq8/s1600/page_screenshot.jpg"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 400px; height: 357px;" src="http://1.bp.blogspot.com/_UZtU3sSV0MY/TEcFjfDqtgI/AAAAAAAAACo/CIYrQuBRIq8/s400/page_screenshot.jpg" alt="" id="BLOGGER_PHOTO_ID_5496367977419421186" border="0" /&gt;&lt;/a&gt;But the transcription of Winkler also highlights some weaknesses I see in ProofreadPage.  All annotation is done via footnotes which&amp;mdash;although they are embedded within the source&amp;mdash;are a far cry from the kind of markup we're used to with TEI or indeed HTML.  In fact, aside from the footnotes and page numbers, there are no hyperlinks in the displayed pages at all.  The inadequacies of this for someone who wants extensive text markup are highlighted by &lt;a href="http://bit.ly/bEhAxe"&gt;this personal name index page&lt;/a&gt; — it's a hand-compiled index!  Had the tool (or its users) relied on in-text markup, such an index could be compiled by mining the markup.  Of course, the reason I'm critical here is that FromThePage was inspired by the possibilities offered by using wiki-links within text to annotate, analyze, edit and index, and I've been delighted by the results.&lt;br /&gt;&lt;br /&gt;When I originally researched ProofreadPage, one question perplexed me: why aren't more manuscripts being transcribed on Wikisource?    A lot has happened since I last participated in the &lt;span class="il"&gt;Wikisource&lt;/span&gt; community in 2004, especially within the realm of formalized rules.  There now is a rule on the English, French, and German &lt;span class="il"&gt;Wikisource&lt;/span&gt; sites banning unpublished work. Apparently the goal was to discourage self-promoters from using the site for their own novels or crackpot theories, and it's pretty drastic.  The English language version specifies that sources must have been previously published on paper, and the French site has added "Ne publiez que des documents qui ont été déjà publiés ailleurs, sur papier" to the edit form itself!  It is a rare manuscript indeed that has already been published in a print form which may be OCRed but which is worth transcribing from handwriting anyway.  As a result, I suspect that we're not likely to see much attention paid to transcription proper within the ProofreadPage code, short of a successful non-&lt;span class="il"&gt;Wikisource&lt;/span&gt; Mediawiki/ProofreadPage project.&lt;br /&gt;&lt;br /&gt;Aside from FromThePage (which is accepting new transcription projects!) ProofreadPage/Mediawiki is my favorite transcription tool.  Its origins outside the English-language community and Wikisource community policy have obscured its utility for transcribing manuscripts, which is why I think it's been overlooked. It's got a lot of &lt;a href="http://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics"&gt;momentum&lt;/a&gt; behind it, and while it is still centered around OCR, I feel like it will work for many needs.  Best of all, it's open-source, so you can start a transcription project by &lt;a href="http://iphylo.blogspot.com/2010/03/setting-up-local-wikisource.html"&gt;setting up your own private wikisource instance&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Thanks to &lt;/span&gt;&lt;a style="font-style: italic;" href="http://de.wikipedia.org/wiki/Klaus_Graf_%28Historiker%29"&gt;Klaus Graf&lt;/a&gt;&lt;span style="font-style: italic;"&gt; at &lt;/span&gt;&lt;a style="font-style: italic;" href="http://archiv.twoday.net/"&gt;Archivalia&lt;/a&gt;&lt;span style="font-style: italic;"&gt; for much of the information in this article.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-3860588161432735050?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/Ly7cT2gUwiM" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2009/07/wikisource-for-manuscript-transcription.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://3.bp.blogspot.com/_UZtU3sSV0MY/TEcEr1ISitI/AAAAAAAAACY/45kOBReGxaw/s72-c/typesetting_screenshot.jpg" height="72" width="72" /><thr:total>7</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-375728219310974241</guid><pubDate>Wed, 16 Jun 2010 02:30:00 +0000</pubDate><atom:updated>2010-06-16T19:34:08.605-05:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">features</category><title>Facebook versus Twitter for Crowdsourcing Document Transcription</title><description>Last week, I posted &lt;a href="http://beta.fromthepage.com/display/read_work?ol=blog&amp;amp;work_id=11"&gt;a new document&lt;/a&gt; to FromThePage and inadvertently conducted a little experiment on how to publicize crowdsourcing.  Sometime in the early 1970s, my uncle was on a hunting trip when he was driven into an old, abandoned building by a thunderstorm.  While waiting for the weather to moderate, he found a few old documents -- two envelopes bearing Confederate stamps, one of which contained a letter.  I photographed this letter and uploaded it to FromThePage while on vacation last week, and that's where the experiment begins.&lt;br /&gt;&lt;br /&gt;I use both Facebook and Twitter, but post entirely different material to them.  My 347 Facebook friends include most of my high school classmates, several of my friends and classmates from college, much of my extended family, and a few of my friends in here in Austin.  I mostly post status updates about my personal life, only occasionally sharing links to Julia Brumfield diaries whenever I read an especially moving passage.  My 344 Twitter followers are almost entirely people I've met at or through conferences like THATCamp08 or THATCamp Austin.  They consist of academics, librarians, archivists, and programmers -- mostly ones who identify in some way with the "digital humanities" label.  I usually tweet about technical/theoretical issues I've encountered in my own DH work.  At least a few of my followers even run their own transcription software projects.  Given the overlap between interesting content and FromThePage development, I decided to post news of the East Civil War letters to both systems.&lt;br /&gt;&lt;br /&gt;My initial tweet/status--posted while I was still cropping the images--got similar responses from both systems.  Two people on Twitter and five on Facebook replied, helping me resolve the letter's year to 1862.  Here's what I posted on FaceBook:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_UZtU3sSV0MY/TBhEvqMUkvI/AAAAAAAAAB4/l7vwJRMQ97M/s1600/fb_date.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 400px; height: 155px;" src="http://1.bp.blogspot.com/_UZtU3sSV0MY/TBhEvqMUkvI/AAAAAAAAAB4/l7vwJRMQ97M/s400/fb_date.jpg" alt="" id="BLOGGER_PHOTO_ID_5483208131894088434" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;And here's the post on Twitter:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_UZtU3sSV0MY/TBhEg3tYGxI/AAAAAAAAABw/f-VmgIw6ywg/s1600/tw_date.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 400px; height: 52px;" src="http://2.bp.blogspot.com/_UZtU3sSV0MY/TBhEg3tYGxI/AAAAAAAAABw/f-VmgIw6ywg/s400/tw_date.jpg" alt="" id="BLOGGER_PHOTO_ID_5483207877824355090" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;After I got one of the envelopes created in FromThePage, I tested the images out by posting again to Facebook.  This update got no response.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_UZtU3sSV0MY/TBhF8fSYaFI/AAAAAAAAACA/OFoWPa4I7V4/s1600/fb_envelope.jpg"&gt;&lt;img style="cursor: pointer; width: 400px; height: 153px;" src="http://4.bp.blogspot.com/_UZtU3sSV0MY/TBhF8fSYaFI/AAAAAAAAACA/OFoWPa4I7V4/s400/fb_envelope.jpg" alt="" id="BLOGGER_PHOTO_ID_5483209451816642642" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The next day, I uploaded the second envelope and letter, then posted to Twitter and Facebook while packing for our return trip.&lt;br /&gt;&lt;br /&gt;Facebook:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_UZtU3sSV0MY/TBhHKNlaZqI/AAAAAAAAACQ/arOehDpjbYk/s1600/fb_letter.jpg"&gt;&lt;img style="cursor: pointer; width: 400px; height: 195px;" src="http://1.bp.blogspot.com/_UZtU3sSV0MY/TBhHKNlaZqI/AAAAAAAAACQ/arOehDpjbYk/s400/fb_letter.jpg" alt="" id="BLOGGER_PHOTO_ID_5483210787094423202" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Twitter:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_UZtU3sSV0MY/TBhHFBDZ1FI/AAAAAAAAACI/ApKrYOrQr8w/s1600/tw_letter.jpg"&gt;&lt;img style="cursor: pointer; width: 400px; height: 53px;" src="http://4.bp.blogspot.com/_UZtU3sSV0MY/TBhHFBDZ1FI/AAAAAAAAACI/ApKrYOrQr8w/s400/tw_letter.jpg" alt="" id="BLOGGER_PHOTO_ID_5483210697831208018" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This time the contrast in responses was striking.  I got 3 click-throughs from Twitter in the first three days, and I'm not entirely sure that one of those wasn't me clicking the bit.ly link by accident.  While my statistics aren't as good for Facebook click-throughs, there were at least 6 I could identify.  More important, however, was the transcription activity -- which is the point of my crowdsourcing project, after all.  Within 3 hours of posting the link on Facebook, one very-occasional user had contributed a transcription, and I'd gotten two personal emails requesting reminders of login credentials from other people who wanted to help with the letter.&lt;br /&gt;&lt;br /&gt;What accounts for this difference?  One possibility is that the archivists and humanists who comprise my Twitter followership are less likely to get excited about a previously-unpublished Civil War letter -- after all, many of them have their own stacks of unpublished material to transcribe.  Another possibility is that the envelope link I posted on Facebook increased people's anticipation and engagement.  However, I suspect that the most important difference is that the Facebook link itself was more compelling due to the inclusion of an image of the manuscript page.  Images are just more compelling than dry text, and Facebook's thumbnail service draws potential volunteers in.&lt;br /&gt;&lt;br /&gt;My conclusion is that it's worth the effort to build an easy Facebook sharing mechanism into any document crowdsourcing tool, especially if that mechanism provides the scanned document as the image thumbnail.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-375728219310974241?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/7cD1uJakqN4" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2010/06/facebook-versus-twitter-for.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/_UZtU3sSV0MY/TBhEvqMUkvI/AAAAAAAAAB4/l7vwJRMQ97M/s72-c/fb_date.jpg" height="72" width="72" /><thr:total>2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-3649272688692015396</guid><pubDate>Wed, 03 Mar 2010 01:27:00 +0000</pubDate><atom:updated>2010-03-02T22:18:50.060-06:00</atom:updated><title>Feature plan for 2010</title><description>Three years ago, I laid out &lt;a href="http://manuscripttranscription.blogspot.com/2007/04/maiden-voyage-plan.html"&gt;a plan&lt;/a&gt; for getting &lt;a href="http://beta.fromthepage.com/?ol=blog"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;FromThePage&lt;/span&gt;&lt;/a&gt; to general availability.  Since that time, I completed most of the &lt;a href="http://manuscripttranscription.blogspot.com/2008/01/feature-triage-for-v10.html"&gt;features&lt;/a&gt; I thought necessary, gained some &lt;a href="http://manuscripttranscription.blogspot.com/2009/02/progress-report-eight-months-after.html"&gt;dedicated users&lt;/a&gt;, and saw the software used to transcribe and annotate over a thousand pages of &lt;a href="http://beta.fromthepage.com/collection/show?collection_id=1&amp;amp;ol=blog"&gt;Julia &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;Brumfield's&lt;/span&gt; diaries&lt;/a&gt;.  However, most of the second half of 2009 was spent using the product in my editorial capacity and developing &lt;a href="http://manuscripttranscription.blogspot.com/2009/04/feature-full-text-searcharticle-link.html"&gt;features&lt;/a&gt; to &lt;a href="http://manuscripttranscription.blogspot.com/2009/03/feature-editorial-toolkit.html"&gt;support&lt;/a&gt; that effort, rather than moving &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;FromThePage&lt;/span&gt; out of its status as a limited beta.&lt;br /&gt;&lt;br /&gt;Here's what my top priorities are for 2010:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;Release &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;FromThePage&lt;/span&gt; as Free/Open Source Software.&lt;/span&gt;&lt;br /&gt;I've written before about &lt;a href="http://manuscripttranscription.blogspot.com/2007/06/money-current-situation.html"&gt;my position as a hobbyist-developer&lt;/a&gt; and my desire not to see my code become &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;abandonware&lt;/span&gt; if I am no longer able to maintain it.  Open Source seems to be the obvious solution, and despite &lt;a href="http://manuscripttranscription.blogspot.com/2009/05/open-source-vs-open-access.html"&gt;my concerns&lt;/a&gt; about Open Access use, I am taking &lt;a href="http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=1845"&gt;some good advice&lt;/a&gt; and publishing the source code under the &lt;a href="http://www.fsf.org/licensing/licenses/agpl-3.0.html"&gt;GNU &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;Affero&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;GPL&lt;/span&gt;&lt;/a&gt;.  I've taken some steps to do this, including migrating towards a &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;GitHub&lt;/span&gt; repository, but hope to wait until I've made a few test drives and developed some installation instructions before I announce the release.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;Complete &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;PDF&lt;/span&gt; Generation and Publish-on-Demand Integration.&lt;/span&gt;&lt;br /&gt;While I think that releasing &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;FromThePage&lt;/span&gt; is the best way forward for the software itself, it doesn't get me any closer to accomplishing my original objective for the project -- sharing Julia &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;Brumfield's&lt;/span&gt; diaries with her family.  It's &lt;a href="http://manuscripttranscription.blogspot.com/2008/04/who-do-i-build-for.html"&gt;hard to balance these goals&lt;/a&gt;, but at this point I think that my efforts after the F/OSS release should be directed towards printable and print-on-demand formats for the diary transcriptions.  I've built &lt;a href="http://manuscripttranscription.blogspot.com/2007/09/progress-report-printing.html"&gt;one proof-of-concept&lt;/a&gt; and I've settled on an output technology (&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_11"&gt;RTeX&lt;/span&gt;), so all that remains is writing the code to do it.&lt;br /&gt;&lt;br /&gt;I'll still work on other features as they occur to me, I'm still editing Julia's 1921 diary, and I'm still looking for other transcription projects to host, but these two goals will be the main focus of my development this year.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-3649272688692015396?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/pEEDS3P7BAM" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2010/03/feature-plan-for-2010.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>1</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-8223375904554374944</guid><pubDate>Tue, 22 Dec 2009 02:50:00 +0000</pubDate><atom:updated>2009-12-22T10:41:03.453-06:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">features</category><title>Feature: Related Pages</title><description>I've been thinking a lot about page-to-subject links lately as I edit and annotate &lt;a href="http://beta.fromthepage.com/display/read_work?ol=blog&amp;amp;work_id=6"&gt;Julia &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;Brumfield's&lt;/span&gt;&lt;/a&gt; 1921 diary.  While I've been able to exploit the &lt;a href="http://manuscripttranscription.blogspot.com/2007/06/progress-report-article-links.html"&gt;links data structure&lt;/a&gt; in &lt;a href="http://manuscripttranscription.blogspot.com/2009/03/feature-editorial-toolkit.html"&gt;editing&lt;/a&gt;, printing, &lt;a href="http://manuscripttranscription.blogspot.com/2007/07/feature-subect-graphs.html"&gt;analyzing&lt;/a&gt; and &lt;a href="http://manuscripttranscription.blogspot.com/2007/04/feature-article-links.html"&gt;displaying &lt;/a&gt;the texts, I really haven't viewed it as a way to navigate from one manuscript page to another.  In fact, the linkages I've made between pages have been pretty boring -- next/previous page links and a table of contents are the limit.  I'm using the page-to-subject links to connect subjects to each other, so why not pages?&lt;br /&gt;&lt;br /&gt;The obvious answer is that the subjects which page A would have most in common with page B are the same ones it would have in common with nearly every other page in the collection.  In the corpus I'm working with, the diarist mentions her son and daughter-in-law in 95% of pages, for the simple reason that she lives with them.  If I choose two pages at random, I find that &lt;a href="http://beta.fromthepage.com/display/display_page?ol=blog&amp;amp;page_id=1197"&gt;March 12, 1921&lt;/a&gt; and &lt;a href="http://beta.fromthepage.com/display/display_page?page_id=977&amp;amp;ol=blog"&gt;August 12, 1919&lt;/a&gt; both contain &lt;a href="http://beta.fromthepage.com/article/show?article_id=4&amp;amp;ol=blog"&gt;Ben&lt;/a&gt; and &lt;a href="http://beta.fromthepage.com/article/show?article_id=141&amp;amp;ol=blog"&gt;Jim&lt;/a&gt; doing agricultural work, &lt;a href="http://beta.fromthepage.com/article/show?article_id=1627&amp;amp;ol=blog"&gt;Josie&lt;/a&gt; doing domestic work, and Julia's near-daily visit to &lt;a href="http://beta.fromthepage.com/article/show?article_id=5&amp;amp;ol=blog"&gt;Marvin's&lt;/a&gt;.  The two pages are connected through those four subjects (as well as this similarly-disappointing "&lt;a href="http://beta.fromthepage.com/article/show?article_id=28&amp;amp;ol=blog"&gt;dinner&lt;/a&gt;"), but not in a way that is at all meaningful.  So I decided that a page-to-page relatedness tool couldn't be built from the page-to-subject link data.&lt;br /&gt;&lt;br /&gt;All that changed two weeks ago, when I was editing the 1921 diary and came across the mention of a "&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;musick&lt;/span&gt; box".  In trying to figure out whether or not Julia was referring to a phonograph by the term, I discovered that the string "&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;musick&lt;/span&gt; box" &lt;a href="http://beta.fromthepage.com/display/search?search_string=%22musick+box%22&amp;amp;collection_id=1&amp;amp;commit=Search&amp;amp;ol=blog"&gt;occurred only two times&lt;/a&gt;: when the phonograph was ordered and the first time Julia heard it played.  Each one of these mentions shed so much light on the other that I was forced to re-evaluate how pages are connected through subjects.  In particular, I was reminded of the "you and one other" recommendations that &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;LibraryThing&lt;/span&gt; offers.  This is a feature that find other users with whom you share an obscure book.  In this case, obscurity is defined as the book &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_4"&gt;occurring&lt;/span&gt; only twice in the system: once in your library, once in the other user's.&lt;br /&gt;&lt;br /&gt;This would be a relatively easy feature to implement in &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;FromThePage&lt;/span&gt;.  When displaying a page, perform this algorithm:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;For each subject link in the page, calculate how many times it is referenced within the collection, then&lt;/li&gt;&lt;li&gt;Sort those subjects by reference count, and&lt;/li&gt;&lt;li&gt;Take the 3 or 4 subject links with the lowest reference count and,&lt;/li&gt;&lt;li&gt;Display the pages which link to those subjects.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;For a really useful experience, I'd want to display keyword-in-context, showing a few words to explain the context in which that other occurrence of "&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;musick&lt;/span&gt; box" appears.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-8223375904554374944?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/n2sjRc7uDFs" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2009/12/feature-related-pages.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-2063278121989274895</guid><pubDate>Fri, 26 Jun 2009 12:02:00 +0000</pubDate><atom:updated>2009-06-26T08:29:24.340-05:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">features</category><title>Connecting With Readers</title><description>While editing and annotating &lt;a href="http://beta.fromthepage.com/display/read_work?ol=blog&amp;amp;work_id=3"&gt;Julia Brumfield's 1919 diary&lt;/a&gt;, I've tried to do research on the people who appear there.  Who was Josie Carr's &lt;a href="http://beta.fromthepage.com/display/read_all_works?article_id=7666&amp;amp;ol=blog"&gt;sister&lt;/a&gt;?  Sites like FindAGrave.com can help, but the results may still be ambiguous: there are two &lt;a href="http://beta.fromthepage.com/display/read_all_works?article_id=336&amp;amp;ol=blog"&gt;Alice Woodings&lt;/a&gt; buried in the area, and either could be a match.&lt;br /&gt;&lt;br /&gt;These questions could be resolved pretty easily through oral interviews -- most of the families mentioned in the diaries are still in the area, and a month spent knocking on doors could probably flesh out the networks of kinship I need for a complete annotation.  However, that's really not time I have, and I can't imagine cold-calling strangers to ask nosy questions about their families -- I'm a computer programmer, after all.&lt;br /&gt;&lt;br /&gt;It turns out that there might be an easier way.  After Sara installed Google Analytics on &lt;a href="http://beta.fromthepage.com/"&gt;FromThePage&lt;/a&gt;, I've been looking at referral log reports that show how people got to the site.  Here's the keyword report for June, showing what keywords people were searching for when they found FromThePage:&lt;br /&gt;&lt;br /&gt;&lt;table border="1" cellpadding="1" cellspacing="0"&gt;&lt;col span="4"&gt;  &lt;tbody&gt;&lt;tr&gt;   &lt;td&gt;Keyword&lt;/td&gt;   &lt;td&gt;Visits&lt;/td&gt;   &lt;td&gt;Pages/Visit&lt;/td&gt;   &lt;td&gt;Avg. Time on Site&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 12.75pt;" height="17"&gt;   &lt;td style="height: 12.75pt;" height="17"&gt;"tup walker"&lt;/td&gt;   &lt;td num="" align="right"&gt;21&lt;/td&gt;   &lt;td num="12.4761904761904" align="right"&gt;12.47619&lt;/td&gt;   &lt;td num="992.85714285714198" align="right"&gt;992.8571&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 12.75pt;" height="17"&gt;   &lt;td style="height: 12.75pt;" height="17"&gt;"letcher craddock"&lt;/td&gt;   &lt;td num="" align="right"&gt;7&lt;/td&gt;   &lt;td num="12.4285714285714" align="right"&gt;12.42857&lt;/td&gt;   &lt;td num="890.142857142857" align="right"&gt;890.1429&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 12.75pt;" height="17"&gt;   &lt;td style="height: 12.75pt;" height="17"&gt;julia craddock brumfield&lt;/td&gt;   &lt;td num="" align="right"&gt;3&lt;/td&gt;   &lt;td num="" align="right"&gt;28&lt;/td&gt;   &lt;td num="624.33333333333303" align="right"&gt;624.3333&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 12.75pt;" height="17"&gt;   &lt;td style="height: 12.75pt;" height="17"&gt;juliacraddockbrumfield&lt;/td&gt;   &lt;td num="" align="right"&gt;3&lt;/td&gt;   &lt;td num="74.3333333333333" align="right"&gt;74.33333&lt;/td&gt;   &lt;td num="" align="right"&gt;1385&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 12.75pt;" height="17"&gt;   &lt;td style="height: 12.75pt;" height="17"&gt;"edwin mayhew"&lt;/td&gt;   &lt;td num="" align="right"&gt;2&lt;/td&gt;   &lt;td num="" align="right"&gt;7&lt;/td&gt;   &lt;td num="" align="right"&gt;117.5&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 12.75pt;" height="17"&gt;   &lt;td style="height: 12.75pt;" height="17"&gt;"eva mae smith"&lt;/td&gt;   &lt;td num="" align="right"&gt;2&lt;/td&gt;   &lt;td num="" align="right"&gt;4&lt;/td&gt;   &lt;td num="" align="right"&gt;76.5&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 12.75pt;" height="17"&gt;   &lt;td style="height: 12.75pt;" height="17"&gt;"josie carr" virginia&lt;/td&gt;   &lt;td num="" align="right"&gt;2&lt;/td&gt;   &lt;td num="" align="right"&gt;6.5&lt;/td&gt;   &lt;td num="" align="right"&gt;117&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 12.75pt;" height="17"&gt;   &lt;td style="height: 12.75pt;" height="17"&gt;1918 candy&lt;/td&gt;   &lt;td num="" align="right"&gt;2&lt;/td&gt;   &lt;td num="" align="right"&gt;4&lt;/td&gt;   &lt;td num="" align="right"&gt;40&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 12.75pt;" height="17"&gt;   &lt;td style="height: 12.75pt;" height="17"&gt;clack stone hubbard&lt;/td&gt;   &lt;td num="" align="right"&gt;2&lt;/td&gt;   &lt;td num="" align="right"&gt;55.5&lt;/td&gt;   &lt;td num="" align="right"&gt;1146&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;These website visitors are fellow researchers, trying to track down the same people that I am.  I've got them on my website, they're engaged -- sometimes deeply so -- with the texts and the subjects, but I don't know who they are, and they haven't contacted me.  Here are a couple of ideas that might help:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Add an introductory HTML block to the collection homepage.  This would allow collection editors to explain their project, solicit help, and whatever contact information.&lt;/li&gt;&lt;li&gt;Add a 'contact us' footer to be displayed on every page of the collection, whether the user is viewing a subject article, reading a work, or viewing a manuscript page.  Since people are finding the site via search engines, they're navigating directly to pages deep within a collection.  We need to display 'about this project', 'contact us', or 'please help' messages on those pages.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;One idea I think would not work is to build comment boxes or a 'contact us' form.  I'm trying to establish a personal connection to these researchers, since in addition to asking "who is Alice Wooding", I'd like to locate other diaries or hunt down other information about local history.  This is really best handled through email, where the barriers to participation are low.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-2063278121989274895?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/PGaAxJ1wINw" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2009/06/connecting-with-readers.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-5162856937829426064</guid><pubDate>Wed, 03 Jun 2009 02:23:00 +0000</pubDate><atom:updated>2009-06-02T22:09:33.537-05:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">similar projects</category><title>Interview with Hugh Cayless</title><description>One of the neatest things to happen in the world of transcription technology this year was the award of an NEH &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;ODH&lt;/span&gt; &lt;a href="http://www.neh.gov/grants/guidelines/digitalhumanitiesstartup.html"&gt;Digital Humanities Start-Up Grant&lt;/a&gt; to &lt;a href="http://www.neh.gov/news/archive/pdf/Awards_09Mar_Pt3_NCtoWI.pdf"&gt;"Image to XML"&lt;/a&gt;, a project exploring image-based transcription at the line and word level.  According to &lt;a href="http://www.lib.unc.edu/spotlight/2009/docsouth_grant.html"&gt;a press release from &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;UNC&lt;/span&gt;&lt;/a&gt;, this will fund development of "a product that will allow librarians to digitally trace handwriting in an original document, encode the tracings in a language known as Scalable Vector Graphics, and then link the tracings at the line or even word level to files containing transcribed texts and annotations."  This is based on the work of &lt;a href="http://www.unc.edu/%7Ehcayless/"&gt;Hugh &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;Cayless&lt;/span&gt;&lt;/a&gt; in developing &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;Img&lt;/span&gt;2XML, which he has described in a &lt;a href="http://www.unc.edu/%7Ehcayless/img2xml/presentation.html"&gt;presentation to &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;Balisage&lt;/span&gt;&lt;/a&gt;, demonstrated at &lt;a href="http://www.unc.edu/%7Ehcayless/img2xml/viewer.html"&gt;this static demo&lt;/a&gt;, and shared at this &lt;a href="http://github.com/hcayless/img2xml/tree/master"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;github&lt;/span&gt; repository&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Hugh was kind enough to answer my questions about the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;Img&lt;/span&gt;2XML project and has allowed me to publish his responses here in interview form:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt; First, let me congratulate you on &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;img&lt;/span&gt;2&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;xml's&lt;/span&gt; award of a Digital Humanities Start-Up Grant. What was that experience like?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Thanks!  I've been involved in writing grant proposals before, and sat on an NEH review panel a couple of years ago.  But this was the first time I've been the primary writer of a grant.  Start-Up grants (understandably) are less work than the larger programs, but it was still a pretty intensive process.  My colleague at &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;UNC&lt;/span&gt;, Natasha Smith, and I worked right down to the wire on it. At research institutions like &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;UNC&lt;/span&gt;, the hard part is not the writing of the proposal, but working through the submission and budgeting process with the sponsored research office.  That's the part I really couldn't have done in time without help.&lt;br /&gt;&lt;br /&gt;The writing part was relatively straightforward.  I sent a draft to Jason Rhody, who's one of the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_11"&gt;ODH&lt;/span&gt; program officers, and he gave us some very helpful feedback.  NEH does tell you this, but it is absolutely vital that you talk to a program officer before submitting.  They are a great resource because they know the process from the inside.  Jason gave us great feedback, which helped me refine and focus the narrative.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;What's the relationship between &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_12"&gt;img&lt;/span&gt;2&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_13"&gt;xml&lt;/span&gt; and the other e-text projects you've worked on in the past?  How did the idea come about?&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;At &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_14"&gt;Docsouth&lt;/span&gt;, they've been publishing page images and transcriptions for years, so mechanisms for doing that had been on my mind.  I did some research on generating structural visualizations of documents using &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_15"&gt;SVG&lt;/span&gt; a few years ago, and presented a paper on it at the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_16"&gt;ACH&lt;/span&gt; conference in Victoria, so I'd some experience with it.  There was also a project I worked on while I was at Lulu where I used &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_17"&gt;Inkscape&lt;/span&gt; to produce a vector version of a bitmap image for calendars, so I knew it was possible.  When I first had the idea, I went looking for tools that could create an &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_18"&gt;SVG&lt;/span&gt; tracing of text on a page, and found &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_19"&gt;potrace&lt;/span&gt; (which is embedded in &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_20"&gt;Inkscape&lt;/span&gt;, in fact).  I found that you can produce really nice tracings of text, especially if you do some &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_21"&gt;pre&lt;/span&gt;-processing to make sure the text is distinct.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt; What kind of &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_22"&gt;pre&lt;/span&gt;-processing was necessary?  Was it all manual, or do&lt;/span&gt;&lt;span style="font-style: italic;"&gt; you think the tracing step could be automated?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;It varies.  The big issue so far has been sorting out how to distinguish text from background (since &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_23"&gt;potrace&lt;/span&gt; converts the image to black and white before running its tracing algorithm), particularly with materials like papyrus, which is quite dark.  If you can eliminate the background color by subtracting it from the image, then you don't have to worry so much about picking a white/black &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_24"&gt;cutover&lt;/span&gt; point--the defaults will work.  So far it's been manual.  One of the agendas of the grant is to figure out how much of this can be automated, or at least streamlined.  For example, if you have a book with pages of similar background color, and you wanted to eliminate that background as part of &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_25"&gt;pre&lt;/span&gt;-processing, it should be possible to figure out the color range you want to get rid of once, and do it for every page image.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;I've read your &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_26"&gt;Balisage&lt;/span&gt; presentation and played around with the viewer demonstration.  It looks like &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_27"&gt;img&lt;/span&gt;2&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_28"&gt;xml&lt;/span&gt; was in proof-of-concept stage back in mid 2008.  Where does the software stand now, and how far do you hope to take it?&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;It hasn't progressed much beyond that stage yet.  The whole point of the grant was to open up some bandwidth to develop the tooling further, and implement it on a real-world project.   We'll be using it to develop a web presentation of the diary of a 19&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_29"&gt;th&lt;/span&gt; century Carolina student, James &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_30"&gt;Dusenbery&lt;/span&gt;, some excerpts from which can be found on Documenting the American South at &lt;a href="http://docsouth.unc.edu/true/mss04-04/mss04-04.html" target="_blank"&gt;http://docsouth.unc.edu/true/&lt;wbr&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_31"&gt;mss&lt;/span&gt;04-04/&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_32"&gt;mss&lt;/span&gt;04-04.html&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;This has all been complicated a bit by the fact that I left &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_33"&gt;UNC&lt;/span&gt; for NYU in February, so we have to sort out how I'm going to work on it, but it sounds like we'll be able to work something out.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;It seems to me that you can automate generating the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_34"&gt;SVG's&lt;/span&gt; pretty easily.  In the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_35"&gt;Dusenbery&lt;/span&gt; project, you're working with a pretty small set of pages and a traditional (i.e. institutionally-backed) structure for managing transcription.  How well suited do you think &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_36"&gt;img&lt;/span&gt;2&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_37"&gt;xml&lt;/span&gt; is to larger, bulk digitization projects like the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_38"&gt;FamilySearch&lt;/span&gt; Indexer efforts to digitize US census records?  Would the format require substantial software to manipulate the transcription/image links?&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;It might.  &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_39"&gt;Dusenbery&lt;/span&gt; gives us a very constrained playground, in which we're pretty sure we can be successful.  So one prong of attack in the project is to do something end-to-end and figure out what that takes.  The other part of the project will be much more open-ended and will involve experimenting with a wide range of materials.  I'd like to figure out what it would take to work with lots of different types of manuscripts, with different &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_40"&gt;workflows&lt;/span&gt;.  If the method looks useful, then I hope we'll be able to do follow-on work to address some of these issues.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;I'm fascinated by the way you've cross-linked lines of text in a transcription to lines of handwritten text in an &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_41"&gt;SVG&lt;/span&gt; image.  One of the features I've wanted for my own project was the ability to embed a piece of an image as an attribute for the transcribed text -- perhaps illustrating an unclear tag with the unclear handwriting itself.  How would &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_42"&gt;SVG&lt;/span&gt; make this kind of linking easier?&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;This is exactly the kind of functionality I want to enable.  If you can get close to the actual written text in a &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_43"&gt;referenceable&lt;/span&gt; way then all kinds of manipulations like this become feasible.  The NEH grant will give us the chance to experiment with this kind of thing in various ways.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Will you be blogging your explorations?  What is the best way for those interested in following its development to stay informed?&lt;/span&gt;&lt;br /&gt; &lt;br /&gt;Absolutely.  I'm trying to work out the best way to do this, but I'd like to have as much of the project happen out in the open as possible.  Certainly the code will be regularly pushed to the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_44"&gt;github&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_45"&gt;repo&lt;/span&gt;, and I'll either write about it there, or on my blog (&lt;a href="http://philomousos.blogspot.com/" target="_blank"&gt;http://philomousos.blogspot.&lt;wbr&gt;com&lt;/a&gt;), or both.  I'll probably twitter about it too (&lt;a href="http://twiter.com/hcayless"&gt;@&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_46"&gt;hcayless&lt;/span&gt;&lt;/a&gt;).  I expect to start work this week...&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Many thanks to Hugh &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_47"&gt;Cayless&lt;/span&gt; for spending the time on this interview.  We're all wishing him and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_48"&gt;img&lt;/span&gt;2&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_49"&gt;xml&lt;/span&gt; the best of luck!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-5162856937829426064?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/UJs85L-Fmrk" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2009/06/interview-with-hugh-cayless.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-4470841472464496466</guid><pubDate>Sun, 17 May 2009 22:46:00 +0000</pubDate><atom:updated>2009-05-17T22:36:26.727-05:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">similar projects</category><title>Review: USGS North American Bird Phenology Program</title><description>Who knew you could track climate change through crowdsourced transcription?  The smart folks at the U. S. Geological Survey, that's who!&lt;br /&gt;&lt;br /&gt;The USGS North American Bird Phenology program encouraged volunteers to submit bird sightings across North America from the 1880s through the 1970s.  These cards are now being transcribed into a database for analysis of migratory pattern changes and what they imply about climate change.&lt;br /&gt;&lt;br /&gt;There's a really nice &lt;a href="http://www.desertusa.com/desertblog/?p=5518"&gt;DesertUSA NewsBlog article&lt;/a&gt; that covers the background of the project:&lt;br /&gt;&lt;blockquote&gt;The cards record more than a century of information about bird migration, a veritable treasure trove for climate-change researchers because they will help them unravel the effects of climate change on bird behavior, said Jessica Zelt, coordinator of the North American Bird Phenology Program at the USGS Patuxent Wildlife Research Center.    &lt;p&gt;That is — once the cards are transcribed and put into a scientific database.&lt;/p&gt;   &lt;p&gt;And that’s where citizens across the country come in - the program needs help from birders and others across the nation to transcribe those cards into usable scientific information.&lt;/p&gt; &lt;/blockquote&gt;  CNN also &lt;a href="http://www.cnn.com/2009/TECH/science/03/26/pp.bird.usgs/index.html"&gt;interviewed a few of the volunteers&lt;/a&gt;:&lt;br /&gt;&lt;blockquote&gt;Bird enthusiast and star volunteer Stella Walsh, a 62-year-old retiree, pecks away at her keyboard for about four hours each day. She has already transcribed more than 2,000 entries from her apartment in Yarmouth, Maine.&lt;p&gt; "It's a lot more fun fondling feathers, but, the whole point is to learn about the data and be able to do something with it that is going to have an impact," Walsh said.&lt;/p&gt;&lt;/blockquote&gt;Let's talk about the software behind this effort.&lt;br /&gt;&lt;br /&gt;The NABPP is fortunate to have a limited problem domain.  A great deal of standardization was imposed on the manuscript sources themselves by the original organizers, so that for example, each card describes only a single species and location.  In addition, the questions the modern researchers are asking of the corpus also limits the problem domain: nobody's going to be doing analysis of spelling variations between the cards.    It's important to point out that this narrow scope exists in spite of wide variation in format between the index cards.  Some are handwritten on pre-printed cards, some are type-written on blank cards, and some are entirely freeform.  Nevertheless, they all describe species sightings in a regular format.&lt;br /&gt;&lt;br /&gt;Because of this limited scope, the developers were (probably) able to build a traditional database and data-entry form, with specialized attributes for species, location, date, or other common fields that could be generalized from the corpus and the needs of the project.  That meant custom-building an application specifically for the NABPP, which seems like a lot of work, but it does not require building the kind of Swiss Army Knife that medieval manuscript transcription requires.  This presents an interesting parallel with other semi-standardized, hand-written historical documents like military muster rolls or pension applications.&lt;br /&gt;&lt;br /&gt;One of the really neat possibilities of subject-specific transcription software is that you can combine training users on the software with training them on difficult handwriting, or variations in the text. NABPP has put together &lt;a href="http://www.pwrc.usgs.gov/bpp/training/Phenology_controller.swf"&gt;a screencast&lt;/a&gt; for this, which walks users through transcribing a few cards from different periods, written in different formats.  This screencast explains how to use the software, but it also explains more traditional editorial issues like what the transcription conventions are, or how to process different formats of manuscript material.&lt;br /&gt;&lt;br /&gt;This is only one of the innovative ways the NABPP deals with its volunteers.  I received a newsletter by email shortly after volunteering, announcing their progress to date (70K cards transcribed) and some changes in the most recent version of the software.  This included some potentially-embarrassing details that a less confident organization might not have mentioned, but which really do a lot.  Users may get used to workarounds to annoying bugs, but in my experience they still remember them and are thrilled when those bugs are finally fixed.   So when the newsletter announces that "The Backspace key no longer causes the previous page to be loaded", I know that they're making some of their volunteers very happy.&lt;br /&gt;&lt;br /&gt;In addition to the newletter, the project also posts &lt;a href="http://www.pwrc.usgs.gov/bpp/DataAndStats.cfm"&gt;statistics on the transcription&lt;/a&gt; project, broken down both by volunteer and by bird.  The top-ten list gives the game-like feedback you'd want in a project like this, although I'd be hesitant to foster competition in a less individually-oriented project.  They're also posting preliminary analyses of the data, including the &lt;a href="http://www.pwrc.usgs.gov/bpp/JFMaps/bars_1920s.jpg"&gt;phenology of barn swallows&lt;/a&gt;, mapped by location and date of first sighting, and broken down by decade. &lt;br /&gt;&lt;br /&gt;Congratulations to the North American Bird Phenology Program for making crowdsourced transcription a reality!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-4470841472464496466?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/WrNUmWql6JI" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2009/05/review-usgs-north-american-bird.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>1</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-5815257425488359885</guid><pubDate>Sat, 02 May 2009 18:21:00 +0000</pubDate><atom:updated>2009-05-02T15:00:34.014-05:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">open source</category><category domain="http://www.blogger.com/atom/ns#">licensing</category><category domain="http://www.blogger.com/atom/ns#">money</category><category domain="http://www.blogger.com/atom/ns#">open access</category><title>Open Source vs. Open Access</title><description>I've reached a point in my development project at which I'd like to go ahead and release &lt;a href="http://beta.fromthepage.com/?ol=blog"&gt;FromThePage&lt;/a&gt; as Open Source.  There are now only two things holding me back.  I'd really like to find a project willing to work together with me to fix any deployment problems, rather than posting my source code on GitHub and leaving users to fend for themselves.  The other problem is a more serious issue that highlights what I think is a conflict between Open Access and Open Source Software.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Open Source/Free Software and Rights of Use&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Most of the attention paid to Open Source software focuses on the user's right to modify the software to suit their needs and to redistribute that (or derivative) code.  However, there is a different, more basic right conferred by Free and Open source licenses: the user's right to use the software for whatever purpose they wish.  The &lt;a href="http://fsfe.org/documents/freesoftware.en.html"&gt;Free Software Definition&lt;/a&gt; lists "Freedom 0" as:&lt;br /&gt;&lt;ul&gt;&lt;li class="indent"&gt; &lt;b&gt;The freedom to run the program, for any purpose.&lt;/b&gt;   &lt;p class="indent"&gt; &lt;em&gt;Placing restrictions on the use of Free Software, such       as time ("30 days trial period", "license expires January 1st, 2004")       purpose ("permission granted for research and non-commercial       use", "may not be used for benchmarking") or       geographic area ("must not be used in country X") makes a program       non-free.&lt;/em&gt;&lt;/p&gt; &lt;/li&gt;&lt;/ul&gt;Meanwhile, the &lt;a href="http://opensource.org/docs/osd"&gt;Open Source Definition&lt;/a&gt;'s sixth criterion is:&lt;br /&gt;&lt;blockquote&gt;&lt;span style="font-weight: bold;"&gt;6. No Discrimination Against Fields of Endeavor&lt;/span&gt;  &lt;p&gt; The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research. &lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;&lt;/p&gt;Traditionally this has not been a problem for non-commercial software developers like me.  Once you decide not to charge for the editor, game, or compiler you've written, who cares how it's used?&lt;br /&gt;&lt;br /&gt;However, if your motivation in writing software is to encourage people to share their data, as mine certainly is, then restrictions on use start to sound pretty attractive.  I'd love for someone to run FromThePage as a commercial service, hosting the software and guiding users through posting their manuscripts online.  It's a valuable service, and is worth paying for.  However, I want the resulting transcriptions to be freely accessible on the web, so that we all get to read the documents that have been sitting in the basements and file folders of family archivists around the world.&lt;br /&gt;&lt;br /&gt;Unfortunately, if you investigate the current big commercial repositories of this sort of data, you'll find that their pricing/access model is the opposite of what I describe.  Both &lt;a href="http://www.footnote.com/choose-a-plan/"&gt;Footnote.com&lt;/a&gt; and &lt;a href="http://www.ancestry.com/subscribe/signup.aspx?offerid=0%3A7858%3A0&amp;amp;SourceId=&amp;amp;TargetId="&gt;Ancestry.com&lt;/a&gt; allow free hosting of member data, but both lock browsing of that data behind  a registration wall.  Even if registration is free, that hurdle may doom the user-created content to be inaccessible, unfindable or irrelevant to the general public.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Open Access&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://en.wikipedia.org/wiki/Open_access_%28publishing%29"&gt;open access movement&lt;/a&gt; has defined this problem with regards to scholarly literature, and I see no reason why their call should not be applied to historical primary sources like the 19th/20th century manuscripts FromThePage is designed to host.  Here's the &lt;a href="http://www.soros.org/openaccess/read.shtml"&gt;Budapest Open Access Initiative's&lt;/a&gt; definition:&lt;br /&gt;&lt;blockquote&gt;By "open access" to this literature, we mean its free availability on          the public internet, permitting any users to read, download, copy, distribute,          print, search, or link to the full texts of these articles, crawl them          for indexing, pass them as data to software, or use them for any other          lawful purpose, without financial, legal, or technical barriers other          than those inseparable from gaining access to the internet itself.&lt;br /&gt;&lt;/blockquote&gt;Both the Budapest and Berlin definitions go on to talk about copyright quite a bit, however since the documents I'm hosting are already out-of-copyright, I don't really think that they're relevant.  What I do have control over is my own copyright interest in the FromThePage software, and the ability to specify whatever kind of copyleft license I want.&lt;br /&gt;&lt;br /&gt;My quandry is this: none of the existing Free or Open Source licenses allow me to require that FromThePage be used in conformance with Open Access.  Obviously, that's because adding such a restriction -- requiring users of FromThePage not to charge for people reading the documents hosted on or produced through the software -- violates the basic principles of Free Software and Open Source.  So where do I find such a license?&lt;br /&gt;&lt;br /&gt;Have other Open Access developers run into such a problem?  Should I hire a lawyer to write me a &lt;span style="font-style: italic;"&gt;sui generis&lt;/span&gt; license for FromThePage?  Or should I just get over the fear that someone, somewhere will be making money off my software by charging people to read the documents I want them to share?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-5815257425488359885?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/7WUhrOBpdAQ" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2009/05/open-source-vs-open-access.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>10</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-7328178099646953700</guid><pubDate>Sun, 05 Apr 2009 18:54:00 +0000</pubDate><atom:updated>2009-04-05T14:45:16.151-05:00</atom:updated><title>Feature: Full Text Search/Article Link integration</title><description>In the last couple of weeks, I've implemented most of the features &lt;a href="http://manuscripttranscription.blogspot.com/2009/03/feature-editorial-toolkit.html"&gt;in the editorial toolkit&lt;/a&gt;.  &lt;a href="http://manuscripttranscription.blogspot.com/2007/06/feaure-user-roles.html"&gt;Scribes &lt;/a&gt;can identify unannotated pages from the table of contents, readers can peruse all pages in a collection &lt;a href="http://manuscripttranscription.blogspot.com/2007/04/feature-article-links.html"&gt;linked to a subject&lt;/a&gt;, and users can perform a full text search.&lt;br /&gt;&lt;br /&gt;I'd like to describe the full text search in some detail, since there are some really interesting things you can do with the interplay between searching and linking.  I also have a few unresolved questions to explore.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Basic Search&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There are a lot of technologies for searching, so my first task was research.  I decided on the simple &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html"&gt;MySQL fulltext search&lt;/a&gt; over SOLR, Sphinx, and &lt;code&gt;acts_as_ferret&lt;/code&gt; because all the data I wanted to search was located within the &lt;code&gt;PAGES&lt;/code&gt; table.  As a result, this only required a migration script, a text input, and a new controller action to implement.  You can see the result on the &lt;a href="http://beta.fromthepage.com/collection/show?collection_id=1&amp;amp;ol=blog"&gt;right hand side of the collection homepage&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Article-based Search&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Once basic search was working, I could start integrating the search capability with subject indexes.  Since each subject link contains the wording in the original text that was used to link to a subject, that wording can be used to seed a text search.  This allows an editor to double-check pages in a collection to see if any references to a subject have been missed.&lt;br /&gt;&lt;br /&gt;For example, &lt;a href="http://beta.fromthepage.com/article/show?article_id=7349&amp;amp;ol=blog"&gt;Evelyn Brumfield&lt;/a&gt; is a grandchild who is mentioned fairly often in Julia's diaries.  Julia spells her name variously as "Evylin", "Evelyn", and "Evylin Brumfield".  So a link from the article page &lt;a href="http://beta.fromthepage.com/display/search?article_id=7349&amp;amp;ol=blog"&gt;performs a full text search&lt;/a&gt; for &lt;span style="font-style: italic;"&gt;"Evylin Brumfield" OR Evelyn OR Evylin&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;While this is interesting, it doesn't directly address the editors need to find references they might have missed.  Since we're able to see all the fulltext matches for Evelyn Brumfield, since we can see &lt;a href="http://beta.fromthepage.com/display/read_all_works?article_id=7349&amp;amp;ol=blog"&gt;all pages that link to the Evelyn Brumfield subject&lt;/a&gt;, why not subtract the second set from the first?  An additional link on the subject page searches for precisely this set: all references to &lt;span style="font-style: italic;"&gt;Evelyn Brumfield&lt;/span&gt; within the text that are not on pages linked to the Evelyn Brumfield subject.&lt;br /&gt;&lt;br /&gt;At the writing of this blog post, the &lt;a href="http://beta.fromthepage.com/display/search?article_id=7349&amp;amp;unlinked_only=true&amp;amp;ol=blog"&gt;results of such a search&lt;/a&gt; are pretty interesting.  The first two pages in the results matched the first name in "Evylin Edmons", in pages that are already linked to &lt;span style="font-style: italic;"&gt;Evelyn Edmonds&lt;/span&gt; subject.  Matched pages 4-7 appear to be references to Evelyn Brumfield in pages that have not been annotated at all.  But we hit pay dirt with page number 3: it's a page that was transcribed and annotated very early during the transcription project, containing reference to Evelyn Brumfield that should be linked to that subject but is not.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Questions&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I originally intended to add links to search for each individual phrase linked to a subject.  However, I'm still not sure this would be useful -- what value would separate, pre-populated searches for "Evelyn", "Evylin", and "Evylin Brumfield" add?&lt;br /&gt;&lt;br /&gt;A more serious question is what exactly I should be searching on.  I adopted a simple approach of searching the annotated XML text for each page.  However, this means that subject name expansions will match a search, even if the words don't appear in the text.  A search for "Brumfield" will return pages in which Julia never wrote Brumfield, merely because they link to "John", which is expanded to "John Brumfield".  This is not a literal text search, and might astonish users.  On the other hand, would a user searching for "Evelyn" expect to see the &lt;span style="font-style: italic;"&gt;Evelyns&lt;/span&gt; in the text, even though they had been spelled "Evylin"?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-7328178099646953700?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/NhnTV41EV0Y" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2009/04/feature-full-text-searcharticle-link.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-5922330684169260923</guid><pubDate>Tue, 24 Mar 2009 02:34:00 +0000</pubDate><atom:updated>2009-03-23T22:52:44.536-05:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">features</category><title>Feature: Mechanical Turk Integration</title><description>At last week's Austin On Rails SXSW party, my friend and compatriot &lt;a href="http://niblets.wordpress.com/"&gt;Steve Odom&lt;/a&gt; gave me a really neat feature idea.  "Why don't you integrate with &lt;a href="https://www.mturk.com/mturk/welcome"&gt;Amazon's Mechanical Turk&lt;/a&gt;?" he asked.  This is an intriguing notion, and while it's not on my own road map, it would be pretty easy to modify FromThePage to support that.  Here's what I'd do to use FromThePage on a more traditional transcription project, with an experienced documentary editor at the head and funding for transcription work:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Page Forks:&lt;/span&gt; I assume that the editor using Mechanical Turk would want double keyed transcriptions to maintain quality, so the application needs to present the same, untranscribed page to multiple people. In the software world, when a project splits, we call this &lt;span style="font-style: italic;"&gt;forking&lt;/span&gt;, and I think that the analogy applies here.  This feature needs to be able to track an entirely separate edit history for the different forks of a page.  This means a new attribute on the master page record describing whether more than one fork exists, and a separate edit history for each fork of a page that's created.  There's no reason to limit these transcriptions to only two forks, even if that's the most common use case, so I'd want to provide a URL that will automatically create a new fork for a new transcriber to work in.  The Amazon HIT (Human Intelligence Task) would have a link to that URL, so the transcriber need never track which fork they're working in, or even be aware of the double keying.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Reconciling Page Forks:&lt;/span&gt; After a page has been transcribed more than one time, the application needs to allow the editor to reconcile the transcriptions.  This would involve a screen displaying the most recent version of two transcriptions alongside the scanned page image. Likely there's a decent Rails plug in already for displaying code diffs, so I could leverage that to highlight differences between the two transcriptions.  A fourth pane would allow the editor to paste in the reconciled transcription into the master page object.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Publishing MTurk HITs:&lt;/span&gt; Since each page is an independent work unit, it should be possible to automatically convert an untranscribed work into MTurk HITs, with a work item for each page.  I don't know enough about how MTurk works, but I assume that the editor would need to enter their Amazon account credentials to have the application create and post the HITs.  The app also needs to prevent the same user from re-transcribing the same page in multiple forks.&lt;br /&gt;&lt;br /&gt;In all, it doesn't sound like more than a month or two worth of work, even performed part-time.  This isn't a need I have for the Julia Brumfield diaries, so I don't anticipate building this any time soon.  Nevertheless, it's fun to speculate.  Thanks, &lt;a href="http://beta.fromthepage.com/collection/show?collection_id=1&amp;ol=blog"&gt;Steve&lt;/a&gt;!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-5922330684169260923?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/qpbmUvA1y9o" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2009/03/feature-mechanical-turk-integration.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-1602213974288520772</guid><pubDate>Wed, 18 Mar 2009 17:21:00 +0000</pubDate><atom:updated>2009-03-18T19:57:36.001-05:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">progress</category><category domain="http://www.blogger.com/atom/ns#">features</category><title>Progress Report: Page Thumbnails and Sensitive Tags</title><description>As anyone reading this blog through &lt;a href="http://manuscripttranscription.blogspot.com/"&gt;the Blogspot website&lt;/a&gt; knows, visual design is not one of my strengths.  One of the challenges that users have with FromThePage is navigation.  It's not apparent from the &lt;a href="http://beta.fromthepage.com/display/display_page?page_id=859&amp;amp;ol=blog"&gt;single-page screen&lt;/a&gt; that clicking on a work title will show you a list of pages.  It's even less obvious from the multi-page work reading screen that the page images are accessible at all on the website. &lt;br /&gt;&lt;br /&gt;Last week, I implemented a suggestion I'd received from my friend &lt;a href="http://www.dmdesigninc.com/"&gt;Dave McClinton&lt;/a&gt;.  The &lt;a href="http://beta.fromthepage.com/display/read_work?ol=blog&amp;amp;page=22&amp;amp;work_id=3"&gt;work reading screen&lt;/a&gt; now includes a thumbnail image of each manuscript page beside the transcription of that page.  The thumbnail is a clickable link to the full screen view of the page and its transcription.  This should certainly improve the site's navigability, and I think it also increases FromThePage's visual appeal.&lt;br /&gt;&lt;br /&gt;I tried a different approach to processing the images from the one I'd used before.  For transcribable page images, I modified the images offline &lt;a href="http://manuscripttranscription.blogspot.com/2007/05/progress-zoom.html"&gt;through a batch process&lt;/a&gt;, then transferred them to the application, which serves them statically.  The only dynamic image processing the FromThePage software did for end-users was involved in &lt;a href="http://manuscripttranscription.blogspot.com/2007/05/feature-zoom.html"&gt;zoom&lt;/a&gt;.  This time, I added a hook to the image link code, so that if a thumbnail was requested by the browser, the application would generate it on the fly.  This turned out to be no harder to code than a batch process, and the deployment was far easier.  I haven't seen a single broken thumbnail image yet, so it looks like it's fairly robust, too.&lt;br /&gt;&lt;br /&gt;The other new feature I added last week was support for &lt;a href="http://manuscripttranscription.blogspot.com/2007/06/feature-sensitive-tags.html"&gt;sensitive tags&lt;/a&gt;.  The support is still fairly primitive -- enclose text with &lt;sensitive&gt; and it will only be desplayed to users authorized to transcribe the work -- but it gets the job done and solves some issues that had come up with Julia Brumfield's 1919 diary.  Happily, this took less than 10 minutes to implement.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-1602213974288520772?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/rE4r9JQg-EI" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2009/03/progress-report-page-thumbnails-and.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-5730930067468816440.post-2199732653970825472</guid><pubDate>Mon, 16 Mar 2009 02:34:00 +0000</pubDate><atom:updated>2009-03-15T22:30:14.393-05:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">requirements</category><category domain="http://www.blogger.com/atom/ns#">features</category><category domain="http://www.blogger.com/atom/ns#">subject links</category><title>Feature: Editorial Toolkit</title><description>I'm pleased to report that my cousin Linda Tucker has finished transcribing &lt;a href="http://beta.fromthepage.com/display/read_work?ol=blog&amp;amp;work_id=3"&gt;the 1919 diary&lt;/a&gt;.  I've been trying my best to keep up with her speed, but she's able to transcribe two pages in the amount of time it takes me to edit and annotate a single, simple page.  If the editing work requires more extensive research, or (worse) reveals the need to re-do several previous pages, there is really no contest.  In the course of this intensive editing, I've come up with a few ideas for new features, as well as a few observations on existing features.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Show All Pages Mentioning a Subject&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Currently, the &lt;a href="http://manuscripttranscription.blogspot.com/2007/04/feature-article-links.html"&gt;article page for each subject&lt;/a&gt; shows a list of the pages on which the subject is mentioned.  This is pretty useful, but it really doesn't serve the purposes of the reader or editor who wants to read every mention of that subject, in context.  In particular, after adding links to 300 diary pages, I realized that "Paul" might be either &lt;a href="http://beta.fromthepage.com/article/show?article_id=246"&gt;Paul Bennett&lt;/a&gt;, Julia's 20-year-old grandson who is making a crop on the farm, or &lt;a href="http://beta.fromthepage.com/article/show?article_id=325&amp;amp;ol=blog"&gt;Paul Smith&lt;/a&gt;, Julia's 7-year-old grandson who lives a mile away from her and visits frequently.  Determining which Paul was which was pretty easy from the context, but navigating the application to each of those 100-odd pages took several hours.&lt;br /&gt;&lt;br /&gt;Based on this experience, I intend to add a new way of filtering the multi-page view, which would display all the transcriptions of all pages that mention a subject.  I've already partially developed this as a way to filter the pages within a work, but I really need to 1) see mentions across works, and 2) make this accessible from the subject article page.  I am embarrassed to admit that the &lt;a href="http://beta.fromthepage.com/display/read_work?article_id=246&amp;amp;ol=blog&amp;amp;work_id=2"&gt;existing work-filtering feature&lt;/a&gt; is so hard to find, that I'd forgotten it even existed.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Autolink&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://manuscripttranscription.blogspot.com/2007/06/progress-report-article-links.html"&gt;Autolink feature&lt;/a&gt; has proven invaluable.  I originally developed it to save myself the bother of typing &lt;code&gt;[[Benjamin Franklin Brumfield, Sr.|Ben]]&lt;/code&gt; every time Julia mentioned "Ben".  However, it's proven especially useful as a way of maintaining editorial consistency.  If I decided that "bathing babies" was worth an index entry on one page, I may not remember that decision 100 pages later.  However, if Autolink suggests &lt;code&gt;[[bathing babies]]&lt;/code&gt; when it sees the string "bathed the baby", I'll be reminded of that.  It doesn't catch every instance , but for subjects that tend to cluster (like occurrences of newborns), it really helps out.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Full Text Search&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Currently there is no text search feature.  Implementing one would be pretty straightforward, but in addition to that I'd like to hook in the Autolink suggester.  In particular, I'd like to scan through pages I've already edited to see if I missed mentions of indexed subjects.  This would be especially helpful when I decide that a subject is noteworthy halfway through editing a work.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Unannotated Page List&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is more a matter of work flow management, but I really don't have a good way to find out which pages have been transcribed but not edited or linked.  It's really hard to figure out where to resume my editing.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;[Update: While this blog post was in draft, I added a status indicator to the table of contents screen to flag pages with transcriptions but no subject links.]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Dual Subject Graphs/Searches&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Identifying names is especially difficult when the only evidence is the text itself.  In some cases I've been able to use &lt;a href="http://manuscripttranscription.blogspot.com/2007/07/feature-subect-graphs.html"&gt;subject graphs&lt;/a&gt; to search for &lt;a href="http://beta.fromthepage.com/article/graph?category_ids%5B%5D=18&amp;amp;min_rank=1&amp;amp;article_id=295"&gt;relationships between unknown and identified people&lt;/a&gt;.  This might be much easier if I could filter either my subject graphs or the page display to see all occurrences of subjects X and Y on the same page.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Research Credits&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Now that the Julia Brumfield Diaries are public, suggestions, corrections, and research is pouring in.  My aunt has telephoned old-timers to ask what "rebulking tobacco" refers to.  A great-uncle has emailed with definitions of more terms, and I've had other conversations via email and telephone identifying some of the people mentioned in the text.  To my horror, I find that I've got no way to attribute any of this information to those sources.  At minimum, I need a large, HTML acknowledgments field at the collection level.  Ideally, I'd figure out an easy-to-use way to attribute article comments to individual sources.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5730930067468816440-2199732653970825472?l=manuscripttranscription.blogspot.com' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/CollaborativeManuscriptTranscription/~4/gcXTqbfbdsI" height="1" width="1"/&gt;</description><link>http://manuscripttranscription.blogspot.com/2009/03/feature-editorial-toolkit.html</link><author>noreply@blogger.com (Ben W. Brumfield)</author><thr:total>0</thr:total></item></channel></rss>

