Re: Where to put scraped Google Book Search OCRs
- From: Lars Aronsson <lars@[redacted]>
- Subject: Re: Where to put scraped Google Book Search OCRs
- Date: Thu, 16 Feb 2006 05:23:45 +0100
James Weiler wrote:
> But I guess it's moot. Quite a fray arose here around the concept of
> "scraping" Google's scans. But nobody has responded to me and said they
> actually want to do OCR on their scrapes, let alone post the results. I'll
> just go at it solo, I guess.
Any book in a Scandinavian language - or generally pertaining to
Scandinavia - will be accepted at Project Runeberg (runeberg.org),
providing there are no copyright issues. We want the page images,
which we publish as the first step. As the second step, anybody
is free to download the images, run OCR, and upload the resulting
raw text. Third, anybody can proofread the text, a page at a
time. Fourth, when the entire text has been proofread, the
resulting e-text can be posted to Project Gutenberg or reused for
other purposes. But we never remove the page images from our
website or the ability to proofread the last remaining OCR error.
For questions on how to help, write to editors@[redacted]
From this description of Project Runeberg's process, you can
conclude that I too find Project Gutenberg insufficient in some
respects, albeit for other reasons than yours. But my way to deal
with this frustration has never been to complain, but to get to
work. And when you compare Michael Hart's chimpanzee with my
orangutan, you find that they share 98% of their DNA.
One book that we took from Google is http://runeberg.org/jvskola/
and you can see that we didn't even wash away the "Google Print"
text in the page margins http://runeberg.org/jvskola/0075.html
(This is a Swedish translation of a German textbook on railways,
published in 1857.)
--
Lars Aronsson (lars@[redacted]
Project Runeberg - free Nordic literature - http://runeberg.org/