<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Stuporglue.org » Digitization</title>
	
	<link>http://stuporglue.org</link>
	<description>My gardening, programming and other DIY exploits</description>
	<lastBuildDate>Mon, 17 May 2010 04:52:48 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/StuporglueorgDigitization" /><feedburner:info uri="stuporglueorgdigitization" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>TakOCR</title>
		<link>http://feedproxy.google.com/~r/StuporglueorgDigitization/~3/jo8GUxaTA-A/</link>
		<comments>http://stuporglue.org/takocr/#comments</comments>
		<pubDate>Sun, 14 Mar 2010 16:11:54 +0000</pubDate>
		<dc:creator>stuporglue</dc:creator>
				<category><![CDATA[Digitization]]></category>
		<category><![CDATA[OCR]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[ocropus]]></category>
		<category><![CDATA[OSX]]></category>
		<category><![CDATA[tako]]></category>
		<category><![CDATA[takocr]]></category>

		<guid isPermaLink="false">http://stuporglue.org/?p=62</guid>
		<description><![CDATA[
TakOCR : Easy OCR for Mac
Tako : Japanese for Octopus
OCRopus : Great Open Source OCR project
TakOCR is a project to fill a need I had. I needed a GUI to an OCR engine for  	my dad. He&#8217;s not really the compile-it-and-use-the-command-line type of guy. He is 	however a Mac using guy, so here are the results for your enjoyment.
Latest downloads
TakOCR.pkg version 1 md5: a7a620e1bbef92c454764c42ce1b4b8e
All packages, sources, uninstaller, etc.
NOTICE:
TakOCR  is no longer supported.  If the existing  program works for you, great!  If it does not  work, I hope you find something else that does.
If  someone wants...<a href="http://stuporglue.org/takocr/">Read the Rest</a>]]></description>
			<content:encoded><![CDATA[<!-- google_ad_section_start --><div id="takotime">
<h1>TakOCR : Easy OCR for Mac</h1>
<p><em>Tako</em> : Japanese for Octopus<br />
<em>OCRopus</em> : Great Open Source OCR project</p>
<p>TakOCR is a project to fill a need I had. I needed a GUI to an OCR engine for  	my dad. He&#8217;s not really the compile-it-and-use-the-command-line type of guy. He is 	however a Mac using guy, so here are the results for your enjoyment.</p>
<h2>Latest downloads</h2>
<p><a href="/tako/downloads/v1/TakOCR.pkg">TakOCR.pkg version 1</a> <em>md5: a7a620e1bbef92c454764c42ce1b4b8e</em><br />
<a href="/tako/downloads/v1/">All packages, sources, uninstaller, etc.</a></p>
<p><span style="color: #ff0000;">NOTICE:</span></p>
<p><span style="color: #ff0000;">TakOCR  is no longer supported.  If the existing  program works for you, great!  If it does not  work, I hope you find something else that does.</span></p>
<p><span style="color: #ff0000;">If  someone wants to give me a Mac with the latest version of OSX, I would be happy to update this  software. :-)<br />
</span></p>
<h2>Usage</h2>
<p>Run the installer program, then just drop images onto the program. The OCRed  	output will be displayed in a window which will pop up.</p>
<p>You will need to quit TakOCR before dropping more images onto it.</p>
<h2>What&#8217;s Included, Copyrights</h2>
<p>TakOCR is really just a bundle of OCRopus, ImageMagick, Ghostscript and a little 	wrapper application to tie it all together. ImageMagick and Ghostscript let you          OCR PDFs, TIFFs, JPEGs, and many more formats.</p>
<p>The wrapper script is just a little Ruby program made into a dropplet application  	with the help of Platypus.</p>
<p>All of the software included is available under Open Source compatible licenses. You may download the sources at the link above and read individual packages licenses if you wish. Software included is : ImageMagick, uilib, libjpeg, leptonlib, libpng, ocropus, OpenFST, tesseract, libtiff, zlib, ghostscript.</p>
<p>TakOCR itself and the script behind the scenes are both placed in the Public Domain</p>
</div>
<!-- google_ad_section_end -->
<p><a href="http://feedads.g.doubleclick.net/~a/juwTERly0LAv91S4HAqd_TGxAcg/0/da"><img src="http://feedads.g.doubleclick.net/~a/juwTERly0LAv91S4HAqd_TGxAcg/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/juwTERly0LAv91S4HAqd_TGxAcg/1/da"><img src="http://feedads.g.doubleclick.net/~a/juwTERly0LAv91S4HAqd_TGxAcg/1/di" border="0" ismap="true"></img></a></p><img src="http://feeds.feedburner.com/~r/StuporglueorgDigitization/~4/jo8GUxaTA-A" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://stuporglue.org/takocr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://stuporglue.org/takocr/</feedburner:origLink></item>
		<item>
		<title>OCRShow</title>
		<link>http://feedproxy.google.com/~r/StuporglueorgDigitization/~3/rB0qXja3QrQ/</link>
		<comments>http://stuporglue.org/ocrshow/#comments</comments>
		<pubDate>Sun, 14 Mar 2010 16:07:20 +0000</pubDate>
		<dc:creator>stuporglue</dc:creator>
				<category><![CDATA[Digitization]]></category>
		<category><![CDATA[Genealogy]]></category>
		<category><![CDATA[OCR]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[ocroshow]]></category>

		<guid isPermaLink="false">http://stuporglue.org/?p=59</guid>
		<description><![CDATA[OCRShow is a single PHP file you can put in a folder with scanned images and their OCRed text to quickly create a website that is indexable by search engines and easily navigable by humans
Features

Display the digitized text and the source image on the same page
Search Engine Friendly URLs
Next/Previous page links
Easy to navigate for human visitors
Easily search the book
Just a single PHP file
Automatically writes .htaccess file if none exists
Easy installation

Download and Install OCRShow
Get It Here
To install OCRShow, you will need to create three types of files for each page you have scanned:

An image file (eg. page_0001.png)
A smaller version of the...<a href="http://stuporglue.org/ocrshow/">Read the Rest</a>]]></description>
			<content:encoded><![CDATA[<!-- google_ad_section_start --><p>OCRShow is a single PHP file you can put in a folder with scanned images and their OCRed text to quickly create a website that is indexable by search engines and easily navigable by humans</p>
<h2>Features</h2>
<ul>
<li>Display the digitized text and the source image on the same page</li>
<li>Search Engine Friendly URLs</li>
<li>Next/Previous page links</li>
<li>Easy to navigate for human visitors</li>
<li>Easily search the book</li>
<li>Just a single PHP file</li>
<li>Automatically writes .htaccess file if none exists</li>
<li>Easy installation</li>
</ul>
<h2>Download and Install OCRShow</h2>
<p><a href="../downloads/ocrshow.txt">Get It Here</a></p>
<p>To install OCRShow, you will need to create three types of files for each page you have scanned:</p>
<ol>
<li>An image file (eg. page_0001.png)</li>
<li>A smaller version of the image file (eg. small_page_0001.png)</li>
<li>A plain text copy of what the image says, named .html (eg. page_0001.png.html)</li>
</ol>
<p>OCRShow will use the &#8217;small_&#8217; prefix and the .html file extension. If you want something different, you will need to edit the code yourself.</p>
<p>Upload the small_, image, and .html files to their own directory. Rename the downloaded php file to index.php and upload it to the same directory. The first time you visit that directory a .htaccess file will be created which is needed to allow search engine friendly URLs.</p>
<ul>
<li>OCRShow requires PHP, mod_rewrite and .htaccess to work.</li>
<li>Search functionality requires grep, sed and sort as well as the ability to run commands with backticks. Most hosting providers will have this</li>
</ul>
<h2>Why Would I Need This?</h2>
<p>In many cases there is only one copy of a genealogy book. Perhaps it is a personal history or a book of rememberance. Physically the book can only be in one place at one time. Thanks to the prolification of scanners, you can easily change that book into a series of images. The images can easily be shared with people you know. What about people you don&#8217;t know? Say a common descendant who is also researching your great-great-great-great-grandpa? In order to let people liek that find your scanned images, you need to get information about the scans into the search engines. Search engines can&#8217;t read images (yet!), so you need to provide text for the images. The easiest way to convert a typed document into text is with Optical Character Recognition software, such as Tesseract.</p>
<p>Search engines follow links when they are building their database, so you need to provide links between the different images and text. OCRShow is a very easy way to create those links and a very easy way to share your scanned documents. You just upload the documents and OCRShow and you&#8217;re done!</p>
<p>I am in the process of scanning several hundred pages of genealogy books so that they will be available digitally to the rest of my family. I decided that it would be good if Google could find them too so that other people who may be researching my ancestors will be able to find the information and hopefully we can help each other out.</p>
<h2>What&#8217;s Your Process?</h2>
<p>I am using Ubuntu 8.10 Linux and an Epson Stylus CX3810 all-in-one printer/scanner. The software xsane will automatically keep incrementing a number at the end of a file name, so I start the book at filename_0001.tif and just keep clicking &#8216;Scan&#8217;. I scan in straight black and white at 300 dpi, it keeps the filesize down and makes it easier for the OCR software to figure out where the letters are.</p>
<p>Once the whole book or document is scanned, I run a short shell script:</p>
<pre>		#!/bin/bash
		for i in *tif;do tesseract $i $i;done
		for i in *tif;do convert $i $i.png;done
		for i in *png;do convert -resize x1000 $i small_$i;done
		rename 's/\.txt/\.html/' *txt
		rename 's/\.tif\.png/\.png/' *png
		rm *tif
</pre>
<p><a href="http://code.google.com/p/tesseract-ocr/" target="_BLANK">Tesseract</a> is an OCR engine developed over 10 years ago by HP, then donated to the open source world. It coverts a tif file into a text file. It is not formatting aware, so columns, pedigree charts, etc. all disapear. In my case this is ok, because I am really just creating text so Google can find the images. I expect that human users will read the text on the images. The convert commands change the tifs into pngs, because most browsers can&#8217;t display tif files. I also create a smaller image so that the user can quickly view the page. Both png and tif are lossless, so we can delete the tif files at the end.</p>
<p>I know that software for this process exists for Windows and Macintosh systems too, but I am not familiar with what your options are. If you have recomendations, especially free recomendations, please let me know and I will post them here.</p>
<!-- google_ad_section_end -->
<p><a href="http://feedads.g.doubleclick.net/~a/ebfvPznpzmFlR1gYgDEAFzvbUAM/0/da"><img src="http://feedads.g.doubleclick.net/~a/ebfvPznpzmFlR1gYgDEAFzvbUAM/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/ebfvPznpzmFlR1gYgDEAFzvbUAM/1/da"><img src="http://feedads.g.doubleclick.net/~a/ebfvPznpzmFlR1gYgDEAFzvbUAM/1/di" border="0" ismap="true"></img></a></p><img src="http://feeds.feedburner.com/~r/StuporglueorgDigitization/~4/rB0qXja3QrQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://stuporglue.org/ocrshow/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://stuporglue.org/ocrshow/</feedburner:origLink></item>
	</channel>
</rss>
