Re: A signficant OCR milestone
- From: "Jim Weiler" <xooqi@[redacted]>
- Subject: Re: A signficant OCR milestone
- Date: Mon, 27 Oct 2003 18:46:01 -0600
Good questions from Anders. Here are the answers. Mode: Verbose.
> > page images - HUGE and very slow, but WHAT results!
>
> I'm not sure I follow. If the job was very slow, why is the result
> so good? What metrics are you using for the evaluation? Is it clock time
> vs. work time you're measuring, or only errors per page?
Slow refers to clock time, as compared to scanning and recognizing the same
pages had they been scanned as BW TIFFs at 300 DPI.
> What time *did* it take? From taking up the book to saving the final
> text? (And I'm talking work time, not clock time -- OCR is done on clock
> time, as it's possible to do something else while it's running, but
> proofreading is work time.)
I started by cutting out all the pages. Five minutes labor with a $15
guillotine cutter. Then I fed them through my Brother MFC 3100C (a sheetfeed
scanner) 20 sheets at a time. The scanning was slow, on the order of 2 or 3
sheets a minute, because I was scanning at 400DPI, 8 bit gray. But my only
involvement was the effort it took to start each batch of sheets. 15 seconds
per batch times 25 batches - about 8 minutes labor. The program I used for
batch scanning is IrfanView. The result: 246 one-half-megabyte JPG graphics,
saved at 50% compression. These I loaded into FineReader version 6 on an old
233 MHz PC. About 1 minute's work for me. Hours elapsed while FileReader
slowly loaded each file in turn and added it to the recognition batch. When
I returned to the computer later that day I started batch OCR. Again, just a
few seconds work for me. I don't know how long the task ran because I let it
run overnight. Recognizing the text on each page didn't take long, but once
again there was a load-delay as each page was loaded and displayed -
processing grays. (This step can be speeded up a LOT if you tell FineReader
to do all the displays in black-and-white. But it's still not as fast as
using BW TIFFs because of the intrinsic load-time of half-meg images.) Next
morning, the OCR was done. I ran FineReader's interactive spell-checker,
which took about half an hour. Saved as RTF for final layout. Total labor
time: about 45 minutes, mostly spent doing the unnecessary spell-check.
> The metric I've been most interested in is work time, with the goal
> to get it as low as possible while still getting as close to zero
> errors as possible (compared to the original, that is, misprints and
> all).
Finally, the bulk of the work time: Layout and proofreading. Because you
CAN'T trust the machine-result. My proofing method is to print and bind the
book, then read it cover-to-cover, slowly (150-200 WPM) so as to catch
(most) spuriously recognized words - but NOT checking against the original
book unless I run into a suspect. That amounts to two evening's spare time
for me - An hour to lay-out, print and bind. Three or more hours of
reading.. If you want to keep the misprints (like a word with three
consecutive letters "S") you'll be able to take note of them during the
interactive spell-check in FineReader. Then another 10 minutes to correct
layout mistakes and typos.
Then repeat the whole process because a hard-disk glitch changes all your
file names to variations on _. and cross-links every directory
entry.
> More interesting ... how much of the result has to do with the press
> quality and factors you have no control over? It works only if you
> select the book?
This is the very crux. Press quality is everything. I got great results
applying this method to an absolutely beautiful copy of Gunman's Reckoning
by Max Brand. Every letter on the pages was inked to perfection and the
paper smooth and bright. Applying the same methods to The Girl Scouts Rally,
with uneven inking and very coarse paper produced results I'm more used to:
several errors per page, and numerous places which were over-inked where the
recognition produced an unusable hash of gibberish. Still, overall these
errors won't add more than an hour to the total book preparation time. Using
these methods on an old pulp copy of The Cipher Detective, so badly inked
and impressed that most characters are blotched or blurred and so yellowed
as to be brown-paper-bag brown, produced horrible results, even with
recognition-training. At a guess, every third word is wrong. Maybe every
other. The Cipher Detective will take only slightly less time to prepare
using OCR than it would take a fast typist to type it cover-to-cover.
> And even more interesting ... is this FineReader 7 you've been using?
> I haven't quite made my mind up yet whether to upgrade my old 6 or not ...
As noted earlier: FineReader 6. I gave the FineReader 7 free demo a quick
comparative test-run a few weeks ago and didn't see any improvement over
version 6 recognizing the same set of BW book-page scans. Both versions
produced excellent results with exactly the same errors. I tried only a few
pages. Clean 200- and 300 DPI scans of pages from 90-year-old books. Your
mileage may vary. Tax, title and license separate.
Jim Weiler
the xooqi guy