Book People Archive

Re: Where to put scraped Google Book Search OCRs

From: Lars Aronsson <lars@[redacted]>
Subject: Re: Where to put scraped Google Book Search OCRs
Date: Thu, 16 Feb 2006 05:23:45 +0100

James Weiler wrote:

> But I guess it's moot. Quite a fray arose here around the concept of 
> "scraping" Google's scans. But nobody has responded to me and said they 
> actually want to do OCR on their scrapes, let alone post the results. I'll 
> just go at it solo, I guess.

Any book in a Scandinavian language - or generally pertaining to 
Scandinavia - will be accepted at Project Runeberg (runeberg.org), 
providing there are no copyright issues.  We want the page images, 
which we publish as the first step.  As the second step, anybody 
is free to download the images, run OCR, and upload the resulting 
raw text.  Third, anybody can proofread the text, a page at a 
time.  Fourth, when the entire text has been proofread, the 
resulting e-text can be posted to Project Gutenberg or reused for 
other purposes.  But we never remove the page images from our 
website or the ability to proofread the last remaining OCR error.

For questions on how to help, write to editors@[redacted]

From this description of Project Runeberg's process, you can 
conclude that I too find Project Gutenberg insufficient in some 
respects, albeit for other reasons than yours.  But my way to deal 
with this frustration has never been to complain, but to get to 
work.  And when you compare Michael Hart's chimpanzee with my 
orangutan, you find that they share 98% of their DNA.

One book that we took from Google is http://runeberg.org/jvskola/ 
and you can see that we didn't even wash away the "Google Print" 
text in the page margins http://runeberg.org/jvskola/0075.html

(This is a Swedish translation of a German textbook on railways, 
published in 1857.)

-- 
  Lars Aronsson (lars@[redacted]
  Project Runeberg - free Nordic literature - http://runeberg.org/