Re: the good, the bad, and the ugly
- From: "Nick Hodson" <nicholashodson@[redacted]>
- Subject: Re: the good, the bad, and the ugly
- Date: Sat, 23 Dec 2006 10:53:40 -0000
Thank you, Bowerbird, for your kind words. In many cases the steps forward I
have taken for several years now have been due to your suggestions and/or
criticisms, so you have been a great help.
It is true that a good PDF, meaning a good set of scans, can aid the OCR
process. About six months ago I developed a "beheading" and "befooting"
process that just leaves the body text on a tiff. This makes a great
difference to the speed with which I can start listening to an e-book,
usually even before I have done any other work on the text. I think there
are about 24 filters that a book has to pass through before I consider it
ready. But even then there will be a few errors that went untrapped. I am
happy if it seems that the words are better than 99.95% accurate, and the
punctuation better than that.
The agent that finally deskews the tiffs is FineReader 8. While this does a
better job than version 7 did it is far from perfect. It is better to first
to deskew anything over one degree out of true using Irfanview. After lots
of experimenting my initial scans are now done as gray-scale PNGs at 300
dpi. (Bowerbird was right here, too.) Nearly all my scanning is done on my
Plustek book scanner, so the lines and pages have a good chance of being
straight. If the book is a bit tatty, I guillotine it into a stack of
separate pages, and run it through my Kodak I40. This produces the best
results, but it is a shame to cut up a book that is in good condition.
The software for cropping is easy to use, and gets the majority of the pages
right. I move the scans that seem egregious to a separate folder, and use
Irfanview to sort out the problem. I am not surprised if I see 30 to 60 such
pages on a book that need this treatment, but it does not take long. The
first run of the PDF is used to look for any blemishes on the pages, and
Paint Shop Pro (a very old version) is used to clean them up. Thirty such
blemished pages is not unusual. The final run of the PDF is used to create a
PDF with clean straight well-cropped tiffs, and with Logical Page Numbers,
that Bowerbird was arguing for a few months ago. For the Logical Page
Numbers I use GPStill, which I collaborated with Frank Siegert in Germany to
produce.
Slainte mhath;
Nick Hodson, Athelstane e-Books, London, England, United Kingdom
----- [From the] Original Message -----
From: <Bowerbird@[redacted]>
To: <bookpeople@[redacted] <Bowerbird@[redacted]>
Sent: Friday, December 22, 2006 6:05 PM
Subject: [BP] the good, the bad, and the ugly
> it is perhaps fitting that nicholas would send a message on his
> 400th book on the same day i was ending my umichigan series.
>
> the workflow nicholas has developed can be described as _good_,
> while google's scans are _bad_, and the umichigan text is _ugly_...
>
> i encourage you to download one of the .pdfs that nicholas has
> created which are composed of the scans he's made of a book.
[...]