Book People Archive

feedback to umichigan on "books and culture", part 6

From: Bowerbird@[redacted]
Subject: feedback to umichigan on "books and culture", part 6
Date: Tue, 3 Oct 2006 11:43:59 EDT
welcome back.  here's part 6 of my "feedback to umichigan".

we're digitizing a book using o.c.r. text from umichigan:
>o?=o?= http://mdp.lib.umich.edu/cgi/pt?id=39015016881628

our goal is to turn this raw o.c.r. output into an e-book
with digital text that is searchable/copyable/reflowable.

***

ok, we've come to the final step in the _cleaning_ stage,
which will include some of the _formatting_ as well, so
it's now time to do the page-by-page visual inspection,
and i'll tell you what that entailed...

first, my reparagraphing routine was too conservative.

any automatic process is always a balancing act between
(a) trying to make as many right decisions as possible, and
(b) trying to make as few wrong decisions as possible...

my routine to restore the paragraphs very rarely created
a paragraph incorrectly, but it sure missed a lot of ones
that had been in the paper-book.  it was very easy to do
an edit to the text to restore those paragraphs, but it was
still a more time-consuming hassle than i wanted it to be.
i didn't bother to rework the routine because umichigan
really needs to figure out how to retain the paragraphing.

second, my rehyphenation routine was also conservative,
in the beginning anyway.  so what i did was to jack it up,
to see how many false-alarms it would create in that mode.
i found that it created relatively few, and some of them were
quite entertaining (e.g., "man/goes" became "man-/goes").
i always like it when a routine helps me to see words anew.

one difficult case was rehyphenating across a page-break,
because the routine needs to check words in 2 different files,
so i adapted the startup analysis routine to make that easier.

on the screenshot, in the listbox at the top of the interface,
you'll find the two columns labeled "lastword" and "firstword":
>   http://www.greatamericannovel.com/mabie/scrapeclean.jpg

i put the last word from each page in the "lastword" column,
and the first word on the next page in the "firstword" column,
so i could easily see what a hyphenated combination would be.

if my auto-rehyphenation routine calls them separate words, 
it separates them on the display.  if it calls them a hyphenate,
it puts in a dash and joins them, for easy visual confirmation.

browsing quickly through these to verify them is how i found
the "man/goes" false-alarm (which you find on page 37/38)...

***

most of work i did on this page-by-page run-through was
on reparagraphing.  i didn't really _check_ the hyphenation,
except to check all the highlighted "possible false-alarms".
(because i'd upped the rehyphenation so as to avoid misses,
i created more false-alarms, so i highlighted possible ones.)

i turned on the routine finding "new paragraphs at page-top"
-- it beeps if it hits one -- and fixed those when necessary.
i could -- and will -- automate this little task down the line.

i also reintroduced the italics that were lost by the o.c.r.
there were only four lines with italics, and from working
this book before, i had a good idea about where they were,
so it was pretty easy for me to make those edits.  however,
in an unfamiliar book that might have lots and lots of italics,
it would be an absolute pain to have to reintroduce italics.
so it's _crucially_ important that google/umichigan _change_
their workflow so that their o.c.r. does not lose text-styling.

likewise, i reintroduced the block-quote indentation in the
one place in this book where it exists.  easy enough for me,
on this book, but a pain in the patooty on other books, so
_please_, google/umichigan, _fix_this_.  indentation is vital!

***

now a few thoughts about the book used in this experiment...

i've been dealing with the actual o.c.r. that michigan has posted;
this is what they are showing to their students, staff, and faculty.

this is not a contrived example, pre-selected to give good results.

really, i didn't even _pick_ this book; it's the one that google itself
offered as its first example of a public-domain book it scanned...

***

now a few thoughts about the goal of this experiment...

my goal was to prepare a book for "continuous proofreading",
the system i've proposed where we invite the general public to 
do "final" proofing of a book before it's certified as "error-free".

since it is this process of "continuous proofreading" that will take
a digitized book on its _march_toward_perfection_, my goal here
is merely to make the output "good enough" to offer to the public
so that they can _begin_ their task of "continuous proofreading".

in other words, i see my output as "commencement" for the book,
in the sense that that word means "a beginning" and not "an end".

putting it another way, i don't see my task now as making it perfect.

but a reasonable question to ask is "how good does it have to be,
before we consider it is 'good enough' to offer up to the public?"

as many of you know, i've answered that question often here,
and my answer is this: "an average of 1 error every 10 pages."

there are 280 pages in this book, so i'd consider them a success
if my efforts brought in results that had _28_ or fewer errors in it.

as i have said, this book has already been finished to perfection,
by many different parties.  you might remember jose menendez
did this book, and posted a comparison of _his_ results versus
the digitization produced by distributed proofreaders, finding
that the d.p. version was not as error-free as the one he made:
>   http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&post=2005-09-30,3

(coincidentally, it was almost exactly one year ago today that
jose posted his findings here, on september 30th of 2005...)

jose found that the d.p. version posted by project gutenberg
had 50+ errors, even including _one_whole_missing_page_...

(you might remember a long discussion about whether that
one missing page should count as _one_ error, or whether
every single _word_ on that missing page should count as
a separate one; even the former case showed 50+ errors.)

jose found, in comparing his version to the d.p. one, that he
had come up with an error-free product, all by his lonesome,
proving with pudding that one careful, dedicated proofer can
outperform the thousands over at distributed proofreaders...

it also means that i now have a perfect digitization against which
i can test my results, so we can know for sure what my quality is.

unlike jose, who i bet probably spent more than one hour on it,
i wasn't aiming for perfection.  on the other hand, i was definitely
aiming to improve the error-filled text that umichigan presented,
since i would be embarrassed to post that kind of ugly garbage...

so, on the quality/effort tightrope, i'm looking for <=28 errors...

***

to cut immediately to the end, my output had 12 mistakes in it.

that's less than half the rate i needed to call my work a success,
so i'm happy with the results.  i might add that, after looking at
the mistakes, i feel i could improve my processes immediately,
and reduce my errors to as low as 10, maybe even single digits.

but like i said, an accuracy-rate of 1-error-every-22-pages
is more than good enough to be proud after 1 hour of work,
good enough to send this book to "continuous proofreading".

remember, if you want to look up this text at the umichigan site,
you'll need to _add_6_ to the pagenumber to get the right page.
(the pagenumber is listed to the left of each of the errors here.)

so here are the errors that i failed to catch:

034     among them were. Holinshed's
034     among them were Holinshed's
034     ************** extraneous period
044     things which concern, most closely
044     things which concern most closely
044     ******************* extraneous comma
045     himself in his leisure hours, to think
045     himself, in his leisure hours, to think
045     ******* missing comma
080     and has beer to the leading races;
080     and has been to the leading races;
080     *********** stealth scanno
100     evoke not the shad; but the passion-
100     evoke not the shade, but the passion
100     ****************** stealth scanno
101     mass the facts about any given period,
101     mass the facts about any given period;
101     ************************************* wrong punctuation
129     can possess real culture who has hot,
129     can possess real culture who has not,
129     ********************************* stealth scanno
148     as it was originally put forth, out of
148     as it was originally put forth out of
148     ****************************** extraneous comma
156     there has always been, not only a
156     there has always been not only a
156     ********************* extraneous comma
198     first time, with dear eyes, the depth
198     first time, with clear eyes, the depth
198     ***************** stealth scanno
218     imagination in which, the poets have
218     imagination in which the poets have
218     ******************** extraneous comma
278     literature in the spiritual nature and,
278     literature in the spiritual nature and
278     ************************************** extraneous comma

***

and here's a summary, listed by the types of errors:

page = type of error

0034 = extraneous period
0044 = extraneous comma
0148 = extraneous comma
0156 = extraneous comma
0218 = extraneous comma
0278 = extraneous comma

0045 = missing comma

0101 = wrong punctuation

0080 = stealth scanno
0100 = stealth scanno
0129 = stealth scanno
0198 = stealth scanno

basically, it boils down to two main types of errors, namely
8 punctuation errors and 4 stealth scannos.  interesting...

i don't have much to say about the punctuation errors.
a good look at each scan tells you _why_ they occurred
as i noted the other day, it's unfair to call these "errors",
the o.c.r. is usually reporting a mark that's really there,
even if it's not the _punctuation_mark_ it thought it was.

it might be the case that some processing of the scans
would help to clean up the glitches that cause extra
punctuation to be "recognized".  on the other hand,
processing the scans in this way might well _decrease_ 
the recognition of the existing punctuation.  without
a good round of experimentation, it's difficult to know.

(as i don't even know what o.c.r. package google is using,
it is hard to know how to assess its overall performance.
in general, i think this recognition was rather outstanding,
which is not surprising because these scans were _clean_.
especially at 200% on the umichigan interface, they _pop_.)

just as a note, i _should_ have caught the missing comma on
page 45, since i corrected "himselt" to "himself" right there...

***

although i don't think i've said it _directly_ thus far, this effort
reflects my philosophy of looking only at the "problem" words,
not _all_ words.  when 99%+ of the words are _correct_, i feel
it's a waste of human time and energy to look at _each_one_,
especially since -- with an accuracy rate that high -- the mind
is more apt to miss the occasional incorrect word _anyway_...

the best way to catch incorrect words is to _read_for_content_.
proofreading -- except by expert proofreaders -- will _never_
be as good at catching errors as a person reading for content.
that's why i want to move the "final" check out to real readers,
who know they're a "last check" as "continuous proofreaders",
but who nonetheless are reading the book for its _content_...

***

now, given this philosophy of only looking at "problem words",
some people who have some experience in digitizing are _sure_
to bring up the question of stealth scannos.  these are words
that are recognized incorrectly, but _not_ flagged by a spellcheck
because -- in their erroneous form -- they spell _another_ word.

in this mabie text from umichigan, there were 4 stealth scannos.
one was "have beer to" instead of "have been to", which i could
easily incorporate into a check, since "have beer to" is gonna be
rare enough it wouldn't hurt to flag it.  (a google search shows
it occurs in one p.g. e-text -- #6481 -- as "my doctor says that 
i must have beer to give me strength".)  another stealth scanno
was "who has hot" instead of "who has not", another triad that
could easily be set as a check.  (google finds 3 p.g. e-texts with 
the phrases -- "who was hot in pursuit", "who was hot-headed", 
and "who was hot with anger".)  yet another stealth scanno was
"shad" -- which is a type of herring -- when it should have been
"shade".  whether we even need "shad" in the dictionary will be
a matter that we could think about, but for now, let's accept that.
(again google reports 3 such e-texts: 11118, 12815, and 16033.)
the final stealth scanno was "with dear eyes" for "with clear eyes".
(google reports 492 cases of "dear eyes", 1740 of "clear eyes"...)

the first three might be noticed immediately by a person who was
reading for content, and -- undoubtedly -- would be understood.
(indeed, i'd also think many readers would miss them completely.)

the final one might go unnoticed for a very long time, because
it makes the same sort of sense in its "mistaken" form, which
-- not coincidentally -- makes it a rather harmless error, not?

in this vein, i think that none of these stealth scannos are _serious_,
not in the sense that they have changed the _meaning_ of the text.

(for that matter, the _only_ serious stealth scanno i see regularly
is the switching of "not" and "now", since that often changes the
meaning of a sentence to its complete opposite.  that's serious.
oh yes, i see later on the d.p. version of this text had a switch
of "none" to "more" in a sentence, which is another serious one.
but in general, steal scannos don't change a text's meaning...)

so yes, this philosophy of "looking only at the flagged words"
has a flaw in that it leaves us open to stealth scannos, but still,
as we've shown here with this real-world example, the problem
might not be all _that_ serious; it certainly wasn't in this case...

(and this leaves aside the point i made above, that even if you
_do_ look "at every word", you're still open to a chance of errors.
in this regard, i do believe that i can write computer routines that
will be _far_ better than human eyes at catching stealth scannos.
but i'm still doing research on that, so i can't make any promises.)

in closing the issue of stealth scannos, i repeat that the _best_
way to find 'em is to have the public read the book for content.

***

at any rate, this final clean-up task took me 12 minutes, so i have
spent a grand total of 52 minutes so far, which leaves me 8 minutes
for a final formatting pass, which should be plenty of time for that.

task15:  page-by-page clean-up -- 12 minutes -- total=52 minutes.

***

a recap of the tasks and times so far:

task01:  pre-scraping examination -- 5 minutes -- total=5 minutes
task02:  scraping the o.c.r. text-files -- 2 minutes -- total=7 minutes
task03:  fixing all the running heads -- 2 minutes -- total=9 minutes
task04:  fixing internal pagenumbers -- 2 minutes -- total=11 minutes
task05:  finding some minor problems -- 1 minute -- total=12 minutes
task06:  finding the one major problem -- 1 minute -- total=13 minutes
task07:  generate name-list and review -- 2 minutes -- total=15 minutes
task08:  continue on custom dictionary -- 1 minute -- total=16 minutes
task09:  editing chapter-header pages -- 3 minutes -- total=19 minutes
task10:  edit the front-matter pages -- 3 minutes -- total=22 minutes
task11:  restore umichigan hyphens -- 5 minutes -- total=27 minutes
task12:  finish the custom dictionary -- 2 minutes -- total=29 minutes
task13:  correct the non-word words -- 10 minutes -- total=39 minutes
task14:  re-test for remaining errors -- 1 minute -- total=40 minutes
task15:  page-by-page clean-up -- 12 minutes -- total=52 minutes.

tomorrow we'll finish up with that final formatting pass,
and a summary of my thoughts on this project...

-bowerbird