Book People Archive

feedback to umichigan on "books and culture", part 1

From: Bowerbird@[redacted]
Subject: feedback to umichigan on "books and culture", part 1
Date: Tue, 26 Sep 2006 16:33:27 EDT
as i said last week, _big_ thanks to the university of michigan,
who is making o.c.r. results from the google scanning project
available to the public.   this digital text means we can create
high-powered e-books -- with search/copy/resize utilities --
instead of just looking at the images of the scans of the pages.
as michael hart is fond of saying, a picture of a book is not a book.
so it's very nice of umichigan for them to give us so many _books_.

thus the issue now becomes _transforming_ their o.c.r. output files
-- one for each scanned page -- into a coherent electronic-book...

most of you are well aware that o.c.r. doesn't give us perfect text,
so we need to clean it up.   and after cleaning, we have to _format_
the text so that it can be considered to be an _electronic-book_...

some people might want google or umichigan to do this work for us.
heck, i'd like it a lot if they did.   but i _ain't_ gonna sit around and 
wait;
i'm gonna jump in and do the job myself.   it's better than bellyaching.

furthermore, i've written software that can help _you_ do the job too.

and since i've gone through the procedure lots and lots of times now,
i have experience on doing it the wrong way, and some clues about
how to do it right, so i'll be happy to share that experience with you.

to do that, i'll walk you through the process for one book.

specifically, i have scraped the o.c.r. output files from michigan
for a book -- "books and culture", by hamilton wright mabie --
and i'll be reporting on the process of transforming those files
into a finished e-book, formatted in my "zen markup language".

i'll treat this in "clean-room" fashion, starting from scratch, but
this particular book is one i've previously worked to conclusion;
this means that i have a very polished file to use as my criterion
to test how well i've performed in this do-it-from-scratch test...

(that's an important consideration.   otherwise you don't know if
what you did is all right, or whether there are gaping holes in it.
i also have the luxury of knowing jose menendez did this book,
and he's a very careful and observant guy, so i have confidence
in the _utmost_ that my criterion-file is as accurate as possible.)

so here we go...

but first, although i didn't worry too much about scraping the text
for this one book, i might start to fret if i was planning on scraping
hundreds of books.   because it's not clear what umichigan's policy
is toward scraping.   it would be nice if they would say publicly that
"scraping is ok", but it's probably not reasonable to expect that...

and if scraping is totally unacceptable to them, then i would hope
that umichigan will come right out and say it clearly, so we know.

without a clear policy, i think it's unfair to ban users for scraping.

and -- just to be completely thorough -- it would be _very_nice_
if umichigan would just let us download the whole of the text in
one piece, rather than scraping each page individually.   please?

suggestion1:   let people download the full text in one fell swoop.

i'm rather hopeful that umichigan will implement this suggestion.
after all, google started off by making us scrape each page-scan,
one-by-one, and made anti-scraping noises, but before too long,
they bundled them all into a .pdf and let us download the full set
by simply clicking a button.   the situation here is _fairly_ parallel.

(the distinction, of course, is that we would not expect google to
release the text, since in so doing they would be handing it over
to their search-engine competitors.   however, umichigan is not in
the search-engine business, so they do not have that constraint.
indeed, since they _are_ in the knowledge-distribution business,
it is within their mission to make the text available, which should
bolster their position if google should try to dissuade them at all.)

public-domain books _belong_ to the public.   treat them as such.

but for now, it's not the umichigan policy to hand us all the text,
so let's get to scraping each text-file.   it's really not all that hard.
(indeed, you'll see i've made it not much more than a button-click.)

once again, the book i'm working on is called "books and culture".
you can see the umichigan copy for yourself by going to this url:
>    http://mdp.lib.umich.edu/cgi/pt?id=39015016881628

or, to see the first page of the first chapter, go here:
>    http://mdp.lib.umich.edu/cgi/m/mdp/pt?seq=13&id=39015016881628

before scraping, it's a good idea to see what you're getting into, so
the first course of action is to prod and poke around the scan-set...

i _refuse_ to work with badly-named files (e.g., files with names
that do not reflect the pagenumber of the page they represent),
so one of the first things i do is examine the filenaming structure.

i'll talk about filenaming conventions later, but for this moment,
the important thing to know about "books and culture" is that 
the first page of the first chapter had the pagenumber of _7_.
>    http://mdp.lib.umich.edu/cgi/m/mdp/pt?seq=13&id=39015016881628

(although one might think that paper-books would start on "page 1",
the fact is that a good many books do not.   and this is one of them...)

the filename -- which in online books is essentially the u.r.l. --
almost always has a number in it, which is the important thing.
as you might guess, i say that number should be the pagenumber.

but, as you see, the file-sequence number in the u.r.l. is _13_,
while the _pagenumber_ on the scan from the p-book is _7_.

oops.

bad news.

so, as we knew from previous posts, umichigan fails this test;
their files are badly-named.   my next suggestion, therefore, is
for them to fix this (terrible) flaw in their work-flow process...

suggestion2:   in the name of your files, include their pagenumber.

it's worth noting that umichigan might well just receive the files
-- with their names -- from google, so google is the real culprit,
but i'm giving advice to umichigan now, and not google per se...

still, it would be great if google named things correctly, because
-- in the long run -- it is best for _everyone_involved_ if we all
_use_the_same_names_for_the_same_files._   pay attention, guys.

(it's interesting to note that google itself, in their interface that
presents the scans, actually corrects the discrepancy, so a person
can just enter the actual pagenumber that they want to jump to.
google even embeds that pagenumber into the url of each page,
so that someone looking at the url can _know_ the pagenumber
to which it points.   and this is what umichigan needs to do too.
to do anything otherwise is to invite stupid needless confusion.)

all right, back to the poking-and-prodding.

we're lucky that this book starts on page 7, because that means
we can have _6_ pages of front-matter without being forced to
employ more than one sequence of pagenumbers, which means
you're spared that lesson for a later day, thank your lucky stars...

diagramming it, it will look like this:
>    page 001 -- front matter
>    page 002 -- front matter
>    page 003 -- front matter
>    page 004 -- front matter
>    page 005 -- front matter
>    page 006 -- front matter
>    page 007 -- first page of the first chapter
>    page 008 -- ...book interior continues...
...
>    page 279 -- ...last page of the last chapter
>    page 280 -- ...verso side of "the last page"

in looking at the front-matter pages, again we're in luck, because
the 6 pages that i would take are the 6 pages immediately prior to
page 7 (the first page), so we can scrape the files in a single sweep.
(often you'll find you have to pick-and-choose front-matter pages,
which means you must download them separately, a slight hassle.)

next, as we noted up above, page _7_ (first-page-of-first-chapter)
has the "sequence-number" of _13_ in its u.r.l., we have to fix that.

as i said, i refuse to work with badly-named files like this, so i have
my scraper-program (which i wrote) rename the files _immediately_
on saving them to my hard-disk -- when it's possible, it isn't always --
so i _never_ have any of those smelly names, _ever_, on my hard-disk.

but what needs to be checked is whether the number/name _offset_
is the same throughout the entire scan-set.   at the start, the offset is
6 (13-7), so we need to check what it is at the _end_ of the scan-set.

going to the last page of the book -- which i determine to be 280 --
>   http://mdp.lib.umich.edu/cgi/m/mdp/pt?seq=286&id=39015016881628&view=text

(actually, 279 was the last page with _text_ on it.   but a recto page
can't be "the last page" in a book, because it _must_ have a verso.)

we see that the offset is still 6.   that's good, since a consistent offset
means we can do a uniform rename of the files, which is the easiest...

if there were any unnumbered image-plates in the book, they would
change the offset mid-book, and we'd need to scrape in sections, but
there aren't -- as evidenced by the consistent offset -- so it's easier...

another thing to check for is _skipped_scans_.   (whether they were
skipped accidentally or intentionally is immaterial, they're missing.)
missing scans have the opposite effect of unnumbered image-plates,
in that they cause the offset to _decrease_ in the middle of the book...

but again here, since our offset was consistent from start to finish,
the indication here is that there were no pages that were skipped...

it's possible (and i've seen it happen) that some pages can be skipped
(accidentally) and others are duplicated, so the offset stays consistent
even though some of the internal pages are wrong.   but it's too hard
to check that online, so what i do is, after i've scraped all of the files,
to go through the entire scan-set checking the pagenumbers to see
if any pages were duplicated or missed.   more on that process later.
(but i do that _offline_ rather than _online_ because offline is faster.)

(for the record, when i said "i've seen it happen" up above, i meant 
"i saw it happen on this exact book, when google first posted it".
so yeah, glitches do happen.   after all, there are humans involved.)

for now, while i'm still just poking around the online files, what i do
is check the offset every 50 pages or so, to see whether it changes.
if it does, then i know something weird has happened in the interim.
this is a way to very quickly check for any missing or duplicated pages
and to make yourself aware of any unnumbered image-plate pages...

however, as i said, this book was pretty sweet; the offset was consistent,
throughout the book, so i was ready now to scrape the o.c.r. output files,
having my app rename the files so their names reflected their numbers...

all in all, this pre-scraping examination only took me about 5 minutes.
(it took me a lot longer than that to write it up, and it probably took you
5+ minutes just to read it.   but now you know what to look for next time.)

task1:   pre-scraping examination -- 5 minutes -- total=5 minutes

here's a screenshot of my scraper-program, so you can see what it's like.
>    http://www.greatamericannovel.com/mabie/scraperscreenshot.jpg

to run my scraper-program, i just have to enter the "base" u.r.l., 
tell the program how many pages to download, and what offset
to use when renaming the files while saving them.   the app then
does the job while i go out and smoke some pot.   easy enough.
2 minutes to enter the data, and check for successful completion,
so i'm up to about 7 minutes on this.

task2:   scraping the o.c.r. text-files -- 2 minutes -- total=7 minutes

actual downloading took about 10-30 minutes (not sure, i was high,
plus i put a delay between pages, to be nice to umichigan's servers).
speed will depend on your pipe's fatness, of course, but these are
text-files, so it's gonna be relatively fast even if you are on dialup.

i already had the scans, so i didn't bother to scrape those, and
my image-files were named correctly, so all names were in sync:
>    umabie001.jpg
>    umabie001.txt
>    umabie002.jpg
>    umabie002.txt
...
>    umabie280.jpg
>    umabie280.txt

since the image-files are pretty big, downloading 'em takes a while
if you are on dialup.   still, it's a matter of clicking a button and then
letting the computer do the work, so you can click before bed and 
get up the morning to a full book's worth of scans.   easy enough...

***

now all the files are on my hard-drive, so i'll call it a day for this post.
maybe go back outside and smoke some more pot...          ;+)

more on this process tomorrow, when we start the _cleaning_
of the text, which is the fun part, where i ask the questions like
"what the heck's going on here?" and "what were they thinking?"

***

so let's review the time spent so far:

task1:   pre-scraping examination -- 5 minutes -- total=5 minutes
task2:   scraping the o.c.r. text-files -- 2 minutes -- total=7 minutes

just so you know, my goal is to digitize this book in 1 hour.
i don't know if i can do it.   some people think it's impossible.
we'll see.   as i'm fond of saying, the proof is in the pudding...

***

and to recap today's suggestions:

suggestion1:   let people download the full text in one fell swoop.
suggestion2:   in the name of your files, include their pagenumber.

-bowerbird