another year here on bookpeople
- From: Bowerbird@[redacted]
- Subject: another year here on bookpeople
- Date: Fri, 22 Dec 2006 13:24:46 EST
well, gee, another year wrapped up here on bookpeople.
i managed to make it through _most_ of the year
without bumping up against mommy moderator.
unfortunately, even one bump is one too many...
in years past, it was my "mean" posts that got sent back.
this year, though, it was actually the exact opposite;
the only rejection that i seem to remember was due to
my unwillingness to "provide evidence" for what i said,
because i felt that to do so would be _too_ "mean"...
perry -- you remember perry?, from the umichigan? --
had said something about how robust his system was,
and i replied something to the effect that he was deluded.
i didn't give specifics. but it wasn't because i didn't have any.
on the contrary, i could've given them, with tons of examples.
so many examples that perry would have scurried right away,
thereby ending the dialog. so i couldn't see the purpose of it.
so when john kicked back the post, i just had an (unproductive)
"chat" with him backchannel, and eventually gave up on all of it.
i don't mind being "mean" if doing so serves a constructive end.
but if an action is "mean" _and_ counterproductive, why do it?
turned out that perry scurried off anyway.
so i guess it didn't matter.
except that being "moderated" again -- even that once --
made me remember that i strongly dislike "moderation"...
strongly enough that i cut back on messages for a while
-- just 4 of 'em between october 4 and november 21 --
and i don't intend to post here very much at all in 2007...
i'm not leaving entirely -- that'd make it far too easy on y'all! --
but this certainly won't be one of my main avenues of expression.
there's too many soapboxes in cyberspace to accept being stifled,
especially when you know your heart is good, your intentions pure.
anyway, have a happy new year, and a wonderful life in 2007...
-bowerbird
p.s. i suppose someone will want specifics on the perry thing now.
suffice it to say that the umichigan use of the _sequence_ number
to name the file instead of the _page_ number is badly misguided;
the future will provide many ramifications of that incorrect system.
but that's just the beginning of their problems...
further, their failure to give _unique_ names to each file will prove to
be the source of confusion in a multitude of foreseeable situations,
and probably a bunch of unforeseeable ones as well. perry bragged
his system "has millions of pages already, and will scale up to billions."
he seemed not to realize the stupidity of having millions of _different_
files all named as "00000000001.tif", with the only thing telling 'em
apart being the _folder-name_. ever put things in the wrong folder?
what do you do then? look at each one to determine what it really is?
a filenaming convention like that one is purely and simply _idiotic_.
it's not that hard to develop a smart filenaming system, if you just
follow the simple rules that your files should be named consistently,
with each one having a unique name, which describes its contents.
different files should _always_ have different names, without fail.
and to the greatest extent possible -- and counter-instances are
extremely rare -- the same file should always have the same name.
having 20 million files named as "00000000001.tif", with another
20 million named as "00000000002.tif", and another 20 million
as "00000000003.tif", etc., doesn't fit the bill. umichigan, ulose.
***
continuing, there doesn't seem to be in place a solid, reliable means of
coordination between umichigan and google regarding _ corrections_.
just as one example, the .pdf for "a pair of patient lovers" which google
scanned at umichigan is _missing_its_last_2_pages_. (yes, it's true, and
it's funny too, if you think about it, after you've cried at how tragic it
is.)
but in the umichigan system, the book's last 2 pages have reappeared!
which is a good thing, to be sure. still, aren't these two manifestations
supposed to be _the_same_thing_? they're listed together, one directly
above the other, right there in the umichigan catalog system, after all...
>
http://mirlyn.lib.umich.edu/F/MKRJR8FK5LM2EBKMDPD5RL84F5UV8T57RE3DLYY93K5KPLI7K9-58978
(and hey, isn't _that_ a user-friendly u.r.l.?)
if you make a correction to one, shouldn't you make it to the other as well?
here's the "u-m online" ("pageturner") version:
> http://hdl.handle.net/2027/mdp.39015011041731
and here's the "google online" (pdf) version:
> http://books.google.com/books?vid=UOM39015011041731
of course, _google_ lists the u.r.l. for this same umichigan version as:
> http://books.google.com/books?vid=OCLC00647020&id=dHgTKZUFF5gC
as far as i can tell, they're the same one. same size, both missing those 2
pages.
unlike the u-m one, which is complete. but i only know all that because i
looked.
is somebody going to have to _examine_ all of these various versions that are
floating around cyberspace to determine which ones are the same and/or not?
and don't get those confused with the version google scanned from harvard:
> http://books.google.com/books?vid=OCLC00647020&id=-mCjAdXl5cUC
the one from harvard _does_ contain the last 2 pages, thank goodness.
it's also 1.7 megs bigger. no, not because of those last 2 pages, silly.
because they scanned it at a larger page-size. yeah, pretty stupid to
waste disk space saving larger margins, isn't it? it'd be a lot smarter
and more efficient to crop the scans to the actual pagesize, _plus_
it would make the image-scan .pdfs ten times more pleasant to read.
and if they were deskewed too, they would become even more readable,
not to mention that then the o.c.r. would be improved _remarkably_...
if you can't tell this is a nightmare in the making, you have _zero_
experience.
(and hey, you might want to apply for a job... they're hiring people like
you...)
***
and finally, perry's contention that problems in their text from google
-- missing paragraph indents, hyphens, quotes, and em-dashes --
were the fault of google was _ridiculous_on_its_face_. as i said then,
a search at google doesn't show those flaws. but i guess perry didn't
have a clue that i knew that many of those same problems are present
across the whole panoply of digitization projects there at umichigan.
or maybe perry himself doesn't know it. but it's ridiculous on its face.
yeah, yeah, i can hear you john, you want some "convincing evidence".
ok, here's a good place to start:
> http://www.hti.umich.edu/cgi/t/text/text-idx?page=browsecolls
this is a page that lists all of the various umichigan "collections".
quite an impressive set, it would seem, based on appearances, eh?
so whaddya say we wander around this great cyberspace library...
the "making of america" is one of their biggest collections.
it says it has almost 10,000 books, out of their 30,000 total.
so let's give it a whirl by clicking on it, which takes us here:
> http://www.hti.umich.edu/m/moa/
now click on the "browse moa books" button at the lower left, to go here:
> http://www.hti.umich.edu/cgi/t/text/text-idx?page=browse&cc=moa&c=moa
clicking on the first book, "abaddon, and mahanaim", by berg, will take you
here:
>
http://www.hti.umich.edu/cgi/t/text/text-idx?c=moa;cc=moa;view=toc;idno=AJH1964.0001.001
go down on that page and click the link to jump to page 123 (my favorite):
>
http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=moa;cc=moa;rgn=full%20text;idno=AJH1964.0001.001;didno=AJH1964.0001.001;view=image;seq=00000123
notice the italicized word in the top line?
and the paragraph break halfway down?
now use the "format" pop-up to change from image to text view:
> http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=moa&cc=moa&
idno=ajh1964.0001.001&frm=frameset&view=text&seq=123
look at the text. no italics. no paragraph break halfway down.
just all of the text run together. (even the darn running head.)
and yes, this is the same type of mess we saw with the google text.
except google didn't scan this book. the university of michigan did.
now click the "next page" button to go to page 124:
>
http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=moa;cc=moa;idno=ajh1964.0001.001;frm=frameset;view=text;seq=124;page=root;size=s
see the dash in the first line, before "that word".
now use the "format" pop-up to switch back over to image view:
> http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=moa&cc=moa&
idno=ajh1964.0001.001&frm=frameset&view=image&seq=124
the image shows it's supposed to be an em-dash, not a regular dash.
again, can't blame google for this, can we?
***
ok, well let's try another collection, give 'em a fair chance.
so we'll start here at the top again:
> http://www.hti.umich.edu/cgi/t/text/text-idx?page=browsecolls
let's say we choose the third one under "19th century america",
the "collected works of abraham lincoln". click that to go here:
> http://www.hti.umich.edu/l/lincoln/
next click on "browse":
> http://www.hti.umich.edu/cgi/t/text/text-idx?page=browse&c=lincoln
again, just take the first one, by p.a. hanaford:
>
http://www.hti.umich.edu/cgi/t/text/text-idx?c=lincoln;cc=lincoln;view=toc;idno=ABX9700.0001.001
go look at the image of page 123:
>
http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=lincoln;cc=lincoln;rgn=full%20text;idno=ABX9700.0001.001;didno=ABX9700.0001.001;view=image;seq=000
00127
got it?
ok, now look at the text:
> http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=lincoln&cc=lincoln&
idno=abx9700.0001.001&frm=frameset&view=text&seq=127
again, all run together. and thus totally useless. thanks again,
michigan.
sheesh!
***
ok, one more. back to the top:
> http://www.hti.umich.edu/cgi/t/text/text-idx?page=browsecolls
under "science and technology", try the "great lakes digital library":
> http://www.hti.umich.edu/g/glrr/
click on "browse":
> http://www.hti.umich.edu/cgi/t/text/text-idx?page=browse&c=glrr
take the first one, by robertson:
>
http://www.hti.umich.edu/cgi/t/text/text-idx?c=glrr;cc=glrr;view=toc;idno=4737943.0001.001
only 43 pages in this one, so try page 12:
>
http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=glrr;cc=glrr;rgn=full%20text;idno=4737943.0001.001;didno=4737943.0001.001;view=image;seq=00000015
it's a table. let's switch over and see what the text looks like:
> http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=glrr&cc=glrr&
idno=4737943.0001.001&frm=frameset&view=text&seq=15
oh oh. when you run a table all together, it's _especially_ bad.
what you could call "spectacularly uninformative and hard to fix".
ok, well then let's go back to the index:
>
http://www.hti.umich.edu/cgi/t/text/text-idx?c=glrr;cc=glrr;view=toc;idno=4737943.0001.001
and now we'll try page 23:
>
http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=glrr;cc=glrr;rgn=full%20text;idno=4737943.0001.001;didno=4737943.0001.001;view=image;seq=00000026
bad news. another table. i'm almost afraid to look at the text:
> http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=glrr&cc=glrr&
idno=4737943.0001.001&frm=frameset&view=text&seq=26
um, yep. all run together again. sigh...
what the heck, now that we've prepared ourselves,
might as well see what the "table" of contents looks like:
>
http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=glrr;cc=glrr;rgn=full%20text;idno=4737943.0001.001;didno=4737943.0001.001;view=image;seq=00000002
and the text view?
> http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=glrr&cc=glrr&
idno=4737943.0001.001&frm=frameset&view=text&seq=2
as expected. crap. ok, let's stop this torture...
(i mean, hey, if you have the stomach to look at more books, be my guest.
let me know what you find. but i won't drag you through more examples.)
***
you can look at book after book after book, in one collection after another,
and see this same garbage. these are not isolated problems, and certainly
weren't caused by google. i don't know about you, but i'd be embarrassed
to show this kind of work to _anyone_, let alone to people at a _university_.
i don't want to leave the impression that everything at umichigan is this
bad.
i _have_ been able to tap into some versions that are much better than these.
but the bad stuff... well, frankly, it's very bad. really makes me wonder.
isn't there _somebody_ there who is actually _looking_ at this output?
doesn't _anyone_ have the guts to say, "ann arbor, we have a problem."
***
anyway, i could go on and on, but this "p.s." is longer than the message.
but i tell you, friends, these scanning projects are being _done_wrong_.
the future is gonna look at us and yell "what a bunch of bumbling fools!"
again, i didn't say all this back then, because i felt it was too "mean",
and i didn't want to poison the dialog. and i still feel that it was wrong
for john to "moderate" (i.e., censor) my post because he wanted to have
"more evidence". there's your evidence, john. what good does it do you?
so i'm gonna go get drunk now. ho ho ho. happy new year...