Book People Archive

feedback to umichigan on "books and culture", part 7

From: Bowerbird@[redacted]
Subject: feedback to umichigan on "books and culture", part 7
Date: Fri, 22 Dec 2006 06:03:55 EST
this is "part 7" -- the concluding episode! -- of a series
that i started some three months back.  i got _waylaid_
by some "moderation" (which i will explain shortly) and
never got around to posting this; but to close out 2006,
and have all posts on the same "bookpeople" archive page,
i'll finish it up now.

you might remember i scraped some text from umichigan,
for a book called "books and culture", for which i already had
some rather perfect text, to use as a digitization demonstration.
(having perfect text in hand allowed me to gauge my accuracy.)

the test was to see whether i could digitize the book in one hour,
taking raw o.c.r. all the way through to a highly polished e-book.
to cut directly to the chase, i made my goal, with 3 minutes to spare.

the u.r.l. for the first 6 posts in this series, respectively, are:
>   http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2006&post=2006-09-26,1
>   http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2006&post=2006-09-27,6
>   http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2006&post=2006-09-28,5
>   http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2006&post=2006-09-29,5
>   http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2006&post=2006-10-02,6
>   http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2006&post=2006-10-03,2

you might remember (painfully) that some of these posts were
-- shall we say? -- "rather extended".  i went into great detail...
(some might even say "_excruciating_ detail".  sorry about that.)

all the grunt work was recorded in those posts.  this was "dessert".
nonetheless, i felt i should make this last post to finish the series.

so here we go, back to the post as written originally in october...

***

you know that show called "24", where they
take a 24-hour period in a secret agent's life,
and compress it to a 1-hour television show?

well, i've done the opposite, taking a 1-hour
digitization project and expanded it out into
24 hours worth of listserve posts...            ;+)

but anyway, today concludes this series of
messages giving feedback to umichigan on
their o.c.r. text for "books and culture" by
hamilton wright mabie, which is found here:
>   http://mdp.lib.umich.edu/cgi/pt?id=39015016881628

***

previous days have found us cleaning the text.
now that that job is finished, we need to go on
to _format_ the text to make it a full-on e-book.

in some people's mind, this formatting job can be
just as difficult as the text-cleaning, or even more.

that's because they are unfamiliar with z.m.l.,
which is also known as "zen markup language".

z.m.l. is a format that i've "invented" that allows
you to create a high-powered electronic-book
by following a few dirt-simple formatting rules
so as to structure your e-text in a way that it is
understood by the "intelligent" z.m.l. viewer-app,
which then presents it to the end-user as a full-on,
highly-capable e-book with a raft of functionality...

there's no need to "mark up" the file with all those
bracket-codes that you might know from .html...

a file in z.m.l. format doesn't look much different
from the raw output that you receive from o.c.r.,
so it doesn't take you a lot of energy to "rework" it.

indeed, for the most part, it doesn't take much more
than simple button-clicks to transform the files as
we've been working with them into your z.m.l. file...

before doing that, however, i'll step through the pages
one final time just to make sure that they all look ok...

yep, they're fine.  so i click the button, and i'm done...

total time on this task: a mere 5 minutes.

comparison against the perfect text shows it was done correctly.

so that's it.  we're done.  so let's take care of the final paperwork.

task16:  output the text to z.m.l. -- 5 minutes -- total=57 minutes

***

so, in our final recap, from start to finish, we have:

task01:  pre-scraping examination -- 5 minutes -- total=5 minutes
task02:  scraping the o.c.r. text-files -- 2 minutes -- total=7 minutes
task03:  fixing all the running heads -- 2 minutes -- total=9 minutes
task04:  fixing internal pagenumbers -- 2 minutes -- total=11 minutes
task05:  finding some minor problems -- 1 minute -- total=12 minutes
task06:  finding the one major problem -- 1 minute -- total=13 minutes
task07:  generate name-list and review -- 2 minutes -- total=15 minutes
task08:  continue on custom dictionary -- 1 minute -- total=16 minutes
task09:  editing chapter-header pages -- 3 minutes -- total=19 minutes
task10:  edit the front-matter pages -- 3 minutes -- total=22 minutes
task11:  restore umichigan hyphens -- 5 minutes -- total=27 minutes
task12:  finish the custom dictionary -- 2 minutes -- total=29 minutes
task13:  correct the non-word words -- 10 minutes -- total=39 minutes
task14:  re-test for remaining errors -- 1 minute -- total=40 minutes
task15:  page-by-page clean-up -- 12 minutes -- total=52 minutes
task16:  output the text to z.m.l. -- 5 minutes -- total=57 minutes

we made it under one hour.  woo-hoo!

as meta-comment on the workflow, you see that _many_ steps were
performed on a book-wide basis.  this gives a type of efficiency that
a shared process like distributed proofreaders cannot hope to match.
d.p. has its own efficiencies that might offset that, to be sure.  but
even more telling, they could incorporate some of these book-wide
processes into their own workflow, and reap considerable benefits...
i don't think they're smart enough, or coordinated enough, to do it.
not yet anyway.  but maybe in years to come they will work up to it...

***

now, if you remember, our input for this experiment was
flawed, in that it was missing em-dashes and quote-marks,
which we were unable to replace with any automatic routines,
along with a number of other missing features that we _were_
able to successfully replace automatically (more or less), such as
hyphens on end-of-line hyphenates, text styling and indentation,
and blank lines between paragraphs.  but mostly, this poor text
was pretty much a trainwreck.  nonetheless, it is being offered
by the university of michigan library to its students and faculty.

on the one hand, i'd be embarrassed to offer such low-quality
text if it were me.  on the other hand, i'm glad that umichigan
did make this text available, because then i could _scrape_ it
and turn it into a high-quality e-book.  (on still another hand,
however, i could have gotten the scans from google and did
my own o.c.r. on them, saving myself a lot of clean-up work
in the process, which is certainly what i would do next time,
because life is too short to clean up someone else's messes.)

so although i was tremendously jazzed when i first heard that
umichigan would be setting free their digital text o.c.r. results,
my experience with this one book has soured that enthusiasm.

i'm still very _appreciative_ of the fact they've released the text,
and i still _applaud_ them -- loudly -- for doing the right thing,
but the fact of the matter is _this_ text, anyway, was worthless...

i don't know if the text from other books is as bad as this text,
because honestly, i'm afraid to look, because i suspect that it is.

certainly the text from the "making of america" project, which is
also housed at michigan, has just as many problems as this text,
if not more, so maybe it was just plain silly to hope this project
would be better just because google lauds itself on its quality.
(a person certainly wouldn't be able to verify it by _this_ book.)

so i'll repeat my suggestions to umichigan (and to google too),
and add one final one today ("number 9") just for good measure.

>   suggestion1:  let people download the full text in one fell swoop.
>   suggestion2:  in the name of your files, include their pagenumber.
>   suggestion3:  have your o.c.r. program retain the paragraphing!
>   suggestion4:  make sure your o.c.r. program picks up hyphens!
>   suggestion5:  ensure your workflow doesn't irreparably damage!
>   suggestion6:  if google doesn't do the o.c.r. right, then re-do it.
>   suggestion7:  make sure that o.c.r. results retain all text-styling.
>   suggestion8:  make sure that o.c.r. results retain text indentation.
>   suggestion9:  find a way to incorporate corrections by end-users.

that last one should be fairly obvious.  i've got a perfect text  now.
if umichigan can't manage to switch out their badly-flawed text,
to use the perfect text instead, there is no hope for their future...

***

in summary, although it pains me significantly to have to say it,
since i've been such a _staunch_ defender of google ever since
they announced this project almost 2 years ago, if they cannot
do a better job than indicated by the performance on this book,
maybe they should take some money out of their deep pockets
and hire somebody to do the cleanup job on their text correctly.
because the only fair way to characterize this text is "garbage"...

***

if you'd like to see this mabie book in "continuous proofreading",
you can visit a website where it has been up for some time now:
>   http://www.greatamericannovel.com/mabie/mabiep001.html

if you've looked at any of the pages from this book on the site at
umichigan, then probably the most striking thing you will notice
about my "continuous proofreading" interface is that it displays
_both_ the scan and the text at the same time on the same page,
while umichigan's interface makes you choose one or the other...

***

by the way, in an interesting turn, re-doing this book led me
to discover two errors in the version that i had put up online,
which i _had_ thought was error-free, on pages 6 and 247...

and, with my continuous proofreading interface, i was able to
make a note about each error right on the page where it was,
which will serve as reminders so that i don't forget about them.

***

to see the z.m.l. file of this mabie book, go here:
>   http://www.greatamericannovel.com/mabie/mabie.zml

to compare and contrast, see the o.c.r. results from umichigan,
which i have concatenated into one file and put online as well:
>   http://www.greatamericannovel.com/mabie/umabie.txt

(i rearranged some frontmatter pages for best comparison,
and renamed those stupid umichigan filenames _correctly_.)

i strongly urge you to look at the files in two browser windows.
you'll find it's pretty easy to synch 'em up, using the filenames,
easy to find as they are enclosed in the double-curly-brackets.

viewing the two files side by side, you'll see why _formatting_
is a rather small part of the overall job of making an e-book,
at least when zen markup is chosen as your "master format"...

indeed, it's _remarkable_ how similar these two files appear,
especially when you consider that one is straight out of o.c.r.,
while the other is a polished format that creates an e-book
that is high-powered and as functional as one can demand.
it just goes to show you how little work it takes to "polish" it.

and z.m.l. can easily be transformed to .pdf and .html, which is 
what you seeing when you look at the mabie book on the web:
>   http://www.greatamericannovel.com/mabie/mabiep001.html

(and then, of course, the .html version can also be converted to
a wide variety of other formats, for handhelds, like plucker or
ms-lit, being that .html is now the "rosetta stone" of formats.)

***

also note that if we disregard the missing-character problems
with the umichigan text -- those quote-marks and hyphens --
and just look at the quality of the o.c.r., you see it's quite good.
but that's not surprising, because the scans were done very well.

i haven't computed the percentage of lines perfectly recognized,
but it wouldn't surprise me in the slightest to find it was very high.
and this was on a book that is 100 years old.  if you attend to the
quality of your work every step of the way, digitizing is not hard.

clean scans, deskewed and cropped correctly, give great o.c.r.,
especially if you use one of the better programs now available.

from there, it's a relatively short trip to a highly polished e-book,
one that can even be automated to a surprising large degree, so
large-scale scanning projects like google and o.c.a. _can_ do it.
it's just a question of whether they have the will to prioritize it...

***

in closing this series, i have demonstrated that digitization of
a book can be a relatively simple and _fast_ process, _providing_
you have the right tool.  i finished this book in under an hour,
because my clean-up program facilitated all the tasks involved.

those of you who have digitized books _without_ the right tool
know that it can take a _lot_ longer under those circumstances.

in fact, for literally years now, people have been informing me
that i'm just plain crazy for even suggesting that a book can be
digitized in one hour.  i've known better, all along, and replied
"the proof is in the pudding".  you've just been served pudding.

-bowerbird