feedback to umichigan on "books and culture", part 2
- From: Bowerbird@[redacted]
- Subject: feedback to umichigan on "books and culture", part 2
- Date: Wed, 27 Sep 2006 16:40:46 EDT
ok, yesterday i downloaded the text-files for a book
-- "books and culture", by hamilton wright mabie --
from the university of michigan, to use them to create
an e-book with reflowable, copyable, searchable text...
if you'd like to see this book at the university of michigan:
> http://mdp.lib.umich.edu/cgi/pt?id=39015016881628
this is the only place you can see the actual text i'm using.
you can also choose to see the page-scan for each page.
the downside is that umichigan uses the "sequence number"
instead of the actual pagenumber from the paper-book, so
-- when i use a pagenumber in this post -- you will have to
add _6_ to that pagenumber to get the sequence-number.
yeah, it's a hassle, but direct your complaints to umichigan...
or, to see the scans only, you can go to google:
> http://books.google.com/books?vid=0j28WDMVuyMddngbbu&id=yGZZXIrbUKQC&pg=PA5&lpg=PA5&dq=%22books+and+culture%22&as_brr=1
if you don't need to view the text, google's site is better
because you don't need to compute that silly offset on it...
or, for the best display, with scans and text side-by-side:
> http://www.greatamericannovel.com/mabie/mabiep001.htm
no silly offset, obvious url's, convenient table-of-contents, etc.
still, if you want to check on the actual _text_ that i'm using,
you'll have to use the umichigan site...
***
in order to make this e-book, we've got to clean up the
o.c.r. results, and then format all of the text correctly...
to help me do this cleaning and formatting, i've written
an app. more precisely, i've written _many_ such apps
across the years. and -- to do this job -- i did a rewrite
from scratch, just because it seems that i like to do that,
i'm not sure why...
so here's a screenshot of my cleanup app, "scrape/clean":
> http://www.greatamericannovel.com/mabie/scrapeclean01.jpg
i'll just give you a quick orientation to the program now...
across the top is a listbox that's akin to a "table of contents";
when the app starts, it reads the file-directory of the folder
in which it is located, and records each .txt file in the folder.
(the program actually _opens_ each one of the files, and does
a preliminary analysis of it, and we'll discuss that more later.)
each row of the listbox holds one such file, with its name
being listed in the "filename" column. in this screenshot,
we see names from umabie001.txt through umabie020.txt,
but of course the names run all the way to umabie280.txt...
by clicking on a row in the listbox, you can jump to that page.
at the lower-right is the _image-scan_ of that particular page.
and at the lower-left, you get the _actual_text_ for that page,
in a field that you can edit and save.
so we have an interface that gives us a good overview,
shows us each particular scan, and lets us edit the text.
that's pretty much all that we need.
instead of clicking the listbox, you can use the _cursor-keys_
to move from page to page, so navigation is quick and easy...
in addition, you can just type in a number and press [enter]
and you will be instantly jumped to that page. _very_ handy.
you can also use the [-] and [+] keys to jump to the start
of the previous chapter or the next chapter, respectively.
again, this ability to navigate the book _effortlessly_ will be
a tremendous asset to you in the process of cleaning it up.
***
one of the first things i do with a set of o.c.r. files like this
is to normalize the running-heads, and the pagenumbers.
remember that i have already _named_ the files correctly.
that is, each filename contains the pagenumber of the file;
the names are "umabiep001.txt" through "umabiep280.txt".
each .txt file has a .jpg counterpart with the same name...
it's _so_ handy to know -- just by looking at a file's name --
what pagenumber the content inside represents, i swear...
now what i want to ensure is that the proper pagenumber
is contained _inside_ the file as well. because when you're
actually _inside_ the file, you wanna know what pagenumber
that content represents as well. and, as fortune will have it,
most pages in books have their pagenumber printed on 'em
-- surprise! i am a master of the obvious, don't you think? --
and o.c.r. picks up those pagenumbers and recognizes them
-- not always correctly, just like the regular text, but still... --
so we need to check them, and correct them if they're wrong.
pagenumbers are typically found at the _bottom_ of the page,
or at the _top_ of the page along with the run-head, so my app
pulls out each file's first and last lines and puts 'em in the listbox,
as seen in the screenshot (i.e., "firstline" and "lastline" columns).
the app pulls out these lines when it analyzes each file at startup.
***
ok, so first let's check the running-heads. these are the lines at the
top of book pages that give a shortened version of the chapter head,
the book's title, and/or the author's name, depending on the book...
some digitizers -- including distributed proofreaders -- routinely
_strip_ run-heads. some -- including d.p. in the past -- go so far
as to chop them off with a guillotine before they even scan a book!
that's crazy, because run-heads help orient you in the book while
you're digitizing it, just like they helped orient readers in the past
(which is the role that chapter-title run-heads have always played).
furthermore, especially when the run-heads contain pagenumbers,
an overview of them alerts you to missing or out-of-order pages...
one downside of run-heads is that they are _frequently_ recognized
quite badly by o.c.r. programs. i'm not sure of the reason for this --
it could be their typically all-upper-case nature makes o.c.r. difficult,
or perhaps it's because they are frequently printed in a font that is
different from the body-text, or maybe they are kerned weirdly, but
whatever it is, the fact is that run-heads often produce shitty o.c.r.
fortunately, though, since there's so much _repetition_ in run-heads,
it's relatively simply to write routines that will do necessary correction
across a range of them (e.g., all the run-heads for a certain chapter),
which makes this particular task go a _lot_ faster.
in the current book, however, we didn't need any of those routines,
because we got some quite fantastic recognition on the run-heads.
my program is able to locate any deviant run-heads, which it does by
comparing each run-head to its neighbors to see if they're identical,
and listing any anomalies so that they can be examined more closely.
so all i have to do is click a button, and up pops a list of bad run-heads.
these are the running heads that were irregular in the current book:
> umabiep001.txt----------BOOKS AND CULTURE
> umabiep002.txt----------@@[redacted]
> umabiep003.txt----------To
> umabiep004.txt----------[This page does not contain...
> umabiep117.txt----------Personaflty.
> umabiep134.txt----------The Logic of Free Life@[redacted]
> umabiep156.txt----------&eadth of Life.
> umabiep162.txt----------Ereadth of Life.
> umabiep207.txt----------The Unconscious Element,
> umabiep229.txt----------Ch@[redacted] XX.
> umabiep241.txt----------Culture tnrough Action.
> umabiep252.txt----------rhe Interpretation of Idealism.
> umabiep256.txt----------The Interpretatfon of Idealism.
> umabiep278.txt----------cation of a few formative ideas to life;
the first four are because we need to provide a dummy runhead
for the frontmatter. the rest are simple o.c.r. errors, easily fixed,
_except_ for the last one, where the o.c.r. app simply _missed_
the run-head, which is a mistake that happens not infrequently.
yet still, compared to most books, recognition of the run-heads
in this book was _very_ accurate, even approaching phenomenal,
and as a result it took me a mere 2 minutes to finish this task...
task3: fixing all the running heads -- 2 minutes -- total=9 minutes
***
next we're gonna look at the bottom lines, which on most pages
is the _pagenumber_, there at the bottom of the p-book pages...
also remember that each row of the listbox includes the _name_
of the file (which, as noted above, contains the pagenumber in it).
therefore, we can _cross-check_ these two numbers. if there is
a discrepancy, then we can determine what is causing the glitch,
and fix it. it'll usually be an error in the o.c.r. of the pagenumber.
now, it would be silly to do this cross-checking ourselves, because
this is precisely the kind of task at which computers are excellent.
so of course i programmed a routine in my program to do just that.
now, i just click a button and my clean-up app informs me of these:
> umabiep010.txt ----- 010 ----- I0
> umabiep011.txt ----- 011 ----- ???
> umabiep013.txt ----- 013 ----- 3
> umabiep014.txt ----- 014 ----- 4
> umabiep015.txt ----- 015 ----- 5
> umabiep016.txt ----- 016 ----- i6
> umabiep017.txt ----- 017 ----- 2
> umabiep018.txt ----- 018 ----- i8
> umabiep019.txt ----- 019 ----- L9
> umabiep020.txt ----- 020 ----- *0
> umabiep022.txt ----- 022 ----- ???
> umabiep031.txt ----- 031 ----- 3'
> umabiep033.txt ----- 033 ----- 3 33
> umabiep034.txt ----- 034 ----- 3+
> umabiep041.txt ----- 041 ----- 4'
> umabiep047.txt ----- 047 ----- ???
> umabiep049.txt ----- 049 ----- 4 4@[redacted]
> umabiep050.txt ----- 050 ----- ???
> umabiep051.txt ----- 051 ----- 5'
> umabiep056.txt ----- 056 ----- ???
> umabiep058.txt ----- 058 ----- ???
> umabiep060.txt ----- 060 ----- 6o
> umabiep061.txt ----- 061 ----- 6z
> umabiep062.txt ----- 062 ----- ???
> umabiep065.txt ----- 065 ----- s 6@[redacted]
> umabiep067.txt ----- 067 ----- 6@[redacted]
> umabiep071.txt ----- 071 ----- 7'
> umabiep076.txt ----- 076 ----- ???
> umabiep080.txt ----- 080 ----- 8o
> umabiep081.txt ----- 081 ----- 6 8@[redacted]
> umabiep083.txt ----- 083 ----- 8@[redacted]
> umabiep084.txt ----- 084 ----- b4
> umabiep085.txt ----- 085 ----- 8@[redacted]
> umabiep091.txt ----- 091 ----- 9'
> umabiep097.txt ----- 097 ----- 7 97
> umabiep100.txt ----- 100 ----- I00
> umabiep101.txt ----- 101 ----- ???
> umabiep102.txt ----- 102 ----- T02
> umabiep106.txt ----- 106 ----- zo6
> umabiep108.txt ----- 108 ----- io8
> umabiep110.txt ----- 110 ----- hO
> umabiep111.txt ----- 111 ----- III
> umabiep113.txt ----- 113 ----- 8
> umabiep114.txt ----- 114 ----- I 14
> umabiep115.txt ----- 115 ----- 5
> umabiep116.txt ----- 116 ----- ii6
> umabiep117.txt ----- 117 ----- 7
> umabiep118.txt ----- 118 ----- i i8
> umabiep129.txt ----- 129 ----- 9 129
> umabiep130.txt ----- 130 ----- 3
> umabiep131.txt ----- 131 ----- ???
> umabiep133.txt ----- 133 ----- 33
> umabiep135.txt ----- 135 ----- 35
> umabiep137.txt ----- 137 ----- 37
> umabiep139.txt ----- 139 ----- 39
> umabiep140.txt ----- 140 ----- 4
> umabiep143.txt ----- 143 ----- 43
> umabiep144.txt ----- 144 ----- 44
> umabiep145.txt ----- 145 ----- 10 145
> umabiep147.txt ----- 147 ----- 47
> umabiep149.txt ----- 149 ----- 49
> umabiep150.txt ----- 150 ----- so
> umabiep151.txt ----- 151 ----- Is'
> umabiep153.txt ----- 153 ----- 53
> umabiep154.txt ----- 154 ----- 54
> umabiep155.txt ----- 155 ----- 53
> umabiep156.txt ----- 156 ----- ???
> umabiep157.txt ----- 157 ----- 57
> umabiep158.txt ----- 158 ----- ???
> umabiep159.txt ----- 159 ----- 59
> umabiep160.txt ----- 160 ----- i6o
> umabiep161.txt ----- 161 ----- II i6i
> umabiep165.txt ----- 165 ----- i6@[redacted]
> umabiep166.txt ----- 166 ----- i66
> umabiep168.txt ----- 168 ----- i68
> umabiep171.txt ----- 171 ----- 7'
> umabiep173.txt ----- 173 ----- 73
> umabiep174.txt ----- 174 ----- 74
> umabiep175.txt ----- 175 ----- 75
> umabiep177.txt ----- 177 ----- 12 .177
> umabiep179.txt ----- 179 ----- 79
> umabiep180.txt ----- 180 ----- i8o
> umabiep181.txt ----- 181 ----- i8z
> umabiep185.txt ----- 185 ----- ???
> umabiep186.txt ----- 186 ----- i86
> umabiep187.txt ----- 187 ----- i8@[redacted]
> umabiep188.txt ----- 188 ----- i88
> umabiep191.txt ----- 191 ----- 9'
> umabiep193.txt ----- 193 ----- 3 193
> umabiep194.txt ----- 194 ----- nan - 1q4
> umabiep195.txt ----- 195 ----- 95
> umabiep197.txt ----- 197 ----- 97
> umabiep199.txt ----- 199 ----- 99
> umabiep209.txt ----- 209 ----- 4 209
> umabiep218.txt ----- 218 ----- @[redacted]
> umabiep241.txt ----- 241 ----- i6 241
> umabiep249.txt ----- 249 ----- P19
> umabiep257.txt ----- 257 ----- 7 257
> umabiep273.txt ----- 273 ----- i8 273
the "???" entries indicate that the last line of the file was a long
line of text, usually meaning the o.c.r. picked up no pagenumber.
these all look like simple recognition errors in the pagenumbers,
the kind that are quite understandable from an o.c.r. perspective,
with "1" mistaken as a lower-case "l", and zero for the letter "o".
we could edit each of these individually, although there are a lot.
but again, why do something manually that the machine can do?
so another function i programmed into my program _auto-erases_
any "looks-roughly-like the-right-one" pagenumbers, like those
listed here, and replaces 'em with the _expected_ page-number,
given its filename and the page-numbers of its neighboring files.
in addition, this routine handles the pages that simply don't have
a pagenumber on them. frontmatter pages are often like this, as
are chapter-heading pages. and sometimes an o.c.r. program just
plain misses the pagenumber, even though it's printed right there.
therefore, in all of these cases, my program simply "fills in" missing
pagenumbers, continuing the sequence from correct pagenumbers,
as verified by the pagenumber that is reflected in the actual filename.
some "fill-in" routines that i've written in the past have been
so sophisticated that they were even able to account for any
unnumbered image-plates that popped up in the sequencing.
no such complexity was needed here for this book, however, since
you saw most of the o.c.r. pagenumbers were accurately recognized
or were relatively simple and straightforward to "fill in" by the app...
so i click the button that basically fixes all of these, automatically.
i spent about 50 seconds looking at this list, which left me with
10 whole seconds to click the button and _still_ finish in 1 minute.
then i have the app check again, for any still-remaining anomalies,
and it tells me that there is a bug somewhere around page 72-73.
specifically, there are two files where o.c.r. reports pagenumber 73,
in the file named umabie072.txt and the one named umabie073.txt.
(both were solid numbers in the o.c.r., so no fill-in was performed.)
this could be bad news.
it often means the scanner repeated one page (i.e., page 73),
accidentally, and -- since we do have the right number of files --
that might well mean that they also missed page 72 altogether.
and i'm maybe a tad bit worried, because if the scan-set is really
missing a page, it's usually a _huge_ hassle to have to replace it.
(ok, i'm not _really_ worried, because like i said, i already have
this digital text finished as a highly-polished electronic-book.
but i _would_ be worried if i was really doing this as a project.
after all, i want to _finish_ this digitization in _one_hour_, and
a trip to the library and back takes that much time all by itself.
plus, as mentioned yesterday, there actually _was_ such a miss
on this scan-set when it was originally posted, and the woman
who harvested the scan-set for distributed proofreaders had to
scramble to get the page replaced. and if i remember correctly,
john ockerbloom also went through the hassle of replacing it.)
at this time, however, it seems most everything is correct here.
the scan for page 72 is there. and so is the scan for page 73.
furthermore, the o.c.r. text for both pages seems to be correct;
_except_, that is, the pagenumber of page 72, which o.c.r. has
mistakenly recognized as "73". so we dodged a big bullet here.
it just takes me a second to edit the "73" to the correct "72", and
we don't have to make a trip to the library to try to find the book
and -- if we're _lucky_ and they have a copy -- make a xerox of
the missing page to bring home to lay on the scanner, so that...
...well, you get the picture...
it's a big pain in the ass if a scan-set is missing even one page.
anyway, thanks to my super-duper clean-up app doing the work,
i've managed to clean up all the pagenumbers in just 2 minutes.
woo-hoo!
task4: fixing internal pagenumbers -- 2 minutes -- total=11 minutes
***
thus far, this book has been very easy for us.
but our luck is about to change...
***
the next step is to page through the entire scan-set, checking the
individual pages for any weirdnesses that might reveal themselves.
so here we go...
...and right off the bat, we see that the paragraphing has been lost.
there's no empty line between paragraphs, and no indentation either,
so every page looks like one big paragraph, which is _so_ incorrect...
it's possible to save o.c.r. output in a way that retains paragraphs,
and that's what you _should_ do. but it's not what was done here.
this isn't a show-stopper for me, as i have written routines that will
analyze the text and then reintroduce (almost all) the paragraphs...
(text copied out of a .pdf loses all its paragraphing as well, so this
is a problem i have faced lots and lots and lots of times previously.)
but it's stupid to have to run this routine when the problem can be
avoided in the first place. thus, my next suggestion for umichigan:
suggestion3: have your o.c.r. program retain the paragraphing!
still, i've got a workaround for this problem, so i'll keep on working.
as i page through the entire book, the reparagraphing is done on
each file as it gets pulled in, so all i have to do is remember to save
the file when any paragraphing is added, and check each page-scan
to discern the cases where my routine failed to find a new paragraph.
since i eventually have to step through each page of the book anyway,
doing a visual scan to make sure that all is well, this is not a big deal...
***
so back to the page-by-page review...
one of the options that my program gives is to highlight any words
that are not present in its dictionary. just a few pages into the book
-- page 9, to be exact --is an ugly sight that is _rife_ with highlights.
> http://www.greatamericannovel.com/mabie/page9typos.jpg
the problem reveals itself immediately: the o.c.r. results have lost
almost all hyphens on words that were broken at the end of a line;
sure enough, this pattern repeats itself on other pages in the book.
bad news, dude.
this problem is bad enough most people should abandon this book
and wait until google/michigan puts up a corrected copy. (a look at
google's copy confirms that their database has these words reunited
correctly, which means that _their_ copy must have had the hyphens)
-- but instead i chose to accept the challenge, and to write a routine
that would have my program insert the hyphens wherever required...
after all, the pseudo-code was very easy to write: "if there is a word
at the end of a line that is not in the dictionary, _and_ there is a word
at the start of the next line that is not in the dictionary, and the result
of joining these two words produces a word that _is_ in the dictionary,
then insert a hyphen at the end of the top line." that pseudo-code
would later be refined slightly (with the _and_ changed to an _or_),
but the results were very encouraging, in the sense that the routine
operates very quickly and rarely gives us erroneous results, even if
it sometimes (inevitably) fails to do the right thing in the edge case
where an end-of-line hyphenate was broken such that both pieces
are words that are in the dictionary. (one such example is "be-came".)
so again, we have dodged a major bullet with a fairly robust solution.
nonetheless, it means we have yet another suggestion for umichigan:
suggestion4: make sure your o.c.r. program picks up hyphens!
frankly, i am dismayed that these two hiccups managed to bypass the
quality-control system that umichigan should have in place, and i am
uncertain how that happened. but this _was_ a book done early on.
so maybe they've fixed their workflow since. let's hope that's the case.
at any rate, like the paragraphing, i have solved the hyphen problem...
task5: finding some minor problems -- 1 minute -- total=12 minutes
***
...but my joy that my routines worked sufficiently well was short-lived.
that was because i soon found a larger problem with this file, one that
cannot be so easily programmed around, namely that the o.c.r. results
also dropped all the quotation-marks and em-dashes from the book!
as i mentioned, this is a book that many people have already digitized,
and to my knowledge no one has had any trouble with the em-dashes
or the quotation marks. indeed, my o.c.r. program sees them just fine.
and google's database seems to include them too. so i'm guessing that
these characters were encoded as upper-ascii, perhaps even utf-8, and
somebody at umichigan very unwisely converted the files to lower-ascii,
thereby losing all of those characters. "what were they thinking?", i
ask...
this problem _is_ a showstopper. it's not feasible to think about writing
a routine that would reintroduce the quotemarks. not only would it be
tremendously difficult to do, there would be no use for it anywhere else.
you know the drill -- another suggestion:
suggestion5: ensure your workflow doesn't irreparably damage!
task6: finding the one major problem -- 1 minute -- total=13 minutes
***
but even though the results won't be usable, i will soldier on anyway,
and ignore the fact that i'm missing the quotemarks and em-dashes,
and continue with this book to see how long it takes me to finish it...
after all, i'll be able to repeat the process with the correct files later.
and remember, this is just an exercise anyway. i have this book
digitized already, so it's not as if i _need_ those quotemarks at all.
i can even still use the polished file i already have as my criterion,
simply by deleting the quotemarks and em-dashes from it first...
so i will indeed continue on.
but i'm frustrated enough by the incompetence thus far that i'm
gonna take the rest of the day off. be back tomorrow for more.
***
so let's review how much time i've used so far, doing what:
task1: pre-scraping examination -- 5 minutes -- total=5 minutes
task2: scraping the o.c.r. text-files -- 2 minutes -- total=7 minutes
task3: fixing all the running heads -- 2 minutes -- total=9 minutes
task4: fixing internal pagenumbers -- 2 minutes -- total=11 minutes
task5: finding some minor problems -- 1 minute -- total=12 minutes
task6: finding the one major problem -- 1 minute -- total=13 minutes
***
and let's recap my suggestions to the university of michigan:
suggestion1: let people download the full text in one fell swoop.
suggestion2: in the name of your files, include their pagenumber.
suggestion3: have your o.c.r. program retain the paragraphing!
suggestion4: make sure your o.c.r. program picks up hyphens!
suggestion5: ensure your workflow doesn't irreparably damage!
-bowerbird