Book People Archive

Re: feedback to umichigan on "books and culture", part 1



\perry said:
>    We are not performing any OCR on the materials scanned by Google.
>    Google provides us with both page images and OCR text files for each page.

then you need to point out to google the errors of their ways.
(in fact, you need to tell them to come here and talk to _me_...)

and -- as i noted -- google's own version of the text seems to have
noticed hyphens, as end-of-line hyphenates are correctly rejoined.
neither do they seem to have lost their em-dashes and quotemarks.

i certainly hope google ain't stiffing you with bad text on purpose.
i've been a staunch defender of google all along, but such a type of
bad-faith action would cause me to change my stance immediately.


>    We put the OCR text files in our repository 
>    without any changes or post-processing.

when you put garbage in, people get garbage out.

it's possible to do automatic corrections and fix
a _flock_ of errors.   even if your auto-corrections
don't fix an error, you're no worse off than before.
heck, even if your auto-correction routines should
_cause_new_errors_, you're still probably better off,
all things considered, assuming they work correctly
more often than they work incorrectly (a safe bet)...

but hey, find us those missing characters first, ok?
as long as we have those, _we_ will correct the text.

by "we", i mean _the_general_public_.   when i stated
that you should make this text available to the public
because public-domain books _belong_ to the public,
i didn't mean to imply the responsibilities are one-way,
and you're here to serve us.   it's a two-way street, perry,
and we are willing and able to give you something back.
we appreciate your gift to us, and we'd like to reciprocate.
so if you need help making your text correct, we'll help you!

-bowerbird