Book People Archive

Re: the good, the bad, and the ugly



Thank you, Bowerbird, for your kind words. In many cases the steps forward I 
have taken for several years now have been due to your suggestions and/or 
criticisms, so you have been a great help.

It is true that a good PDF, meaning a good set of scans, can aid the OCR 
process. About six months ago I developed a "beheading" and "befooting" 
process that just leaves the body text on a tiff. This makes a great 
difference to the speed with which I can start listening to an e-book, 
usually even before I have done any other work on the text. I think there 
are about 24 filters that a book has to pass through before I consider it 
ready. But even then there will be a few errors that went untrapped. I am 
happy if it seems that the words are better than 99.95% accurate, and the 
punctuation better than that.

The agent that finally deskews the tiffs is FineReader 8. While this does a 
better job than version 7 did it is far from perfect. It is better to first 
to deskew anything over one degree out of true using Irfanview. After lots 
of experimenting my initial scans are now done as gray-scale PNGs at 300 
dpi. (Bowerbird was right here, too.) Nearly all my scanning is done on my 
Plustek book scanner, so the lines and pages have a good chance of being 
straight. If the book is a bit tatty, I guillotine it into a stack of 
separate pages, and run it through my Kodak I40. This produces the best 
results, but it is a shame to cut up a book that is in good condition.

The software for cropping is easy to use, and gets the majority of the pages 
right. I move the scans that seem egregious to a separate folder, and use 
Irfanview to sort out the problem. I am not surprised if I see 30 to 60 such 
pages on a book that need this treatment, but it does not take long. The 
first run of the PDF is used to look for any blemishes on the pages, and 
Paint Shop Pro (a very old version) is used to clean them up. Thirty such 
blemished pages is not unusual. The final run of the PDF is used to create a 
PDF with clean straight well-cropped tiffs, and with Logical Page Numbers, 
that Bowerbird was arguing for a few months ago. For the Logical Page 
Numbers I use GPStill, which I collaborated with Frank Siegert in Germany to 
produce.

Slainte mhath;
Nick Hodson, Athelstane e-Books, London, England, United Kingdom


----- [From the] Original Message ----- 
From: <Bowerbird@[redacted]>
To: <bookpeople@[redacted] <Bowerbird@[redacted]>
Sent: Friday, December 22, 2006 6:05 PM
Subject: [BP] the good, the bad, and the ugly


> it is perhaps fitting that nicholas would send a message on his
> 400th book on the same day i was ending my umichigan series.
>
> the workflow nicholas has developed can be described as _good_,
> while google's scans are _bad_, and the umichigan text is _ugly_...
>
> i encourage you to download one of the .pdfs that nicholas has
> created which are composed of the scans he's made of a book.

[...]