Book People Archive

Re: feedback to umichigan on "books and culture", part 5

From: "Nick Hodson" <nicholashodson@[redacted]>
Subject: Re: feedback to umichigan on "books and culture", part 5
Date: Tue, 3 Oct 2006 09:20:46 +0100
If you look at the list of words provided by Bowerbird, you will quickly
notice how often OCR reads a capital R as an E or a K. I should suppose that
most normal clean-up programs have a look at some point at capitalised words
to see which are the commonly used ones in the book, and which are likely to
be misreads. The Athelstane system does this as soon as OCR has been
completed, and then again at the very end of cleaning-up, at which point it
compares every capitalised word with every other one, reporting a list of
words that just might have been confused. It is surprising how in perhaps
every third book I process there has been a confusion.

Sometimes this confusion has been in the original book itself. Even the
great Captain Marryat, in one of his books, introduces a character who
appears in the early chapters, and then not again until somewhere near the
end, where his name has been slightly changed, though it is obvious the same
person is intended.

On another point, Bowerbird quite rightly says that you should not use an
OCR program to write its output directly into .txt format, because that will
lose such markup as italics. He recommends the use of the rich text format
(.rtf), which can then be processed to leave such markups as are required. I
found some ten years ago, when I was using TextBridge, that this was the
optimal route for that OCR program. The alternative was to output the text
as html, but TextBridge seemed to have a problem in doing this accurately,
so I fixed on using rtf. Since then I have tried various OCR programs, and
found that they produced rtf files that were not always easy to clean up. So
when I started using ABBYY FineReader I fixed permanently on using html,
putting out a separate file for each page, and then using a program to
convert each page's text to marked-up text, followed by further automatic
processes that end up with a well marked-up file for each chapter. This is
usually so good that a file can be derived from it that approximates to the
audiobook file for the chapter, so you can start listening to the book, even
before you start on the final clean-up of the book's chapter-texts.

The point is that html files are easier to clean-up than rtf ones are, but
that it is necessary to use one or the other.

A very important and related point is that the format (marked-up plain text)
in which you store and edit your chapter files is not the format in which
you publish them. For instance you need to be easily able to create html
files, also .lit files for Microsoft e-book format, Gutenberg-style plain
vanilla ascii files, audiobook files for Fonix ISpeak, audiobook files for
TextAloud MP3, plain files for the TTS reader in your Ipaq, and a few others
besides.

Best wishes to all scanner-people.
Nick Hodson, Athelstane e-Books, London, England, United Kingdom.