Book People Archive

Re: feedback to umichigan on "books and culture", part 1



perry said:
>   It's been our experience that
>   including any kind of semantics in filenames
>   is just a bad idea.

i thought this was a backchannel, so i replied backchannel.

now i discover perry sent that frontchannel.
(my mailbox bounced the bookpeople copy.)

if you consider the time i've spent looking at their work
and writing posts, and remind yourself how much i have
complimented umichigan folks for doing the right thing
and releasing text-files to the public, i think it's obvious
that i don't mean to be picking a fight here with perry...

to the contrary, it is only with the best of intentions that
i'm telling 'em that they need to examine their workflow.

and i'm willing to continue the discussion to tell 'em why.

but i've borne bad news enough to know people do not
like to hear such messages, and often tune me out, and
hey, that's why saying "i told you so" is one of my faults,
because it just so happens that i need to say it too often.
far too often.

you might also remember that i told perry he should feel
absolutely no obligation to respond to me.   i repeat that.
but ya know, perry, if you _are_ gonna respond to me,
you will have to do better than "that's just a bad idea"...

anyway, folks, here's how i responded to perry...

**

perry-

>   This is not on our development plan at the moment,
>   so it's not going to happen any time soon, if at all.

well, a scraping program would allow people
to collect all the text-files for a book _easily_,
and you wouldn't have to implement a thing...

would you object if i released such a program?

***

>   It's been our experience that
>   including any kind of semantics in filenames
>   is just a bad idea.

if you consider the pagenumber to be "semantics",
then your assessment of your experience is wrong.


>   We need to know two things with any page image:
>   know where it fits in the sequence of the book,
>   and know its page number as printed.

and of the two, the second is far more important.
it's the one that people have been using all along.
and it's the one burned in the essence of the page,
the one that it's been carrying since its inception...

the sequence number is more or less an accident.


>   Many people, including us many years ago,
>   have tried to use file names to record both, but
>   it creates all kinds of problems with large repositories.

we can discuss those "all kinds of problems" if you want,
because i know how to make filenames work correctly,
_if_ they are structured correctly.  or you can ignore me,
figuring you know better.  but i will tell you quite frankly
that you're wrong about this.  and your error will bite you.
and let me tell you, with 8 million books, it'll bite you bad...


>   A better method is to use the filename to record the sequence,

wrong.


>   and simply associate the page number with the page image
>   using a database or a standard metadata format such as METS.

wrong wrong wrong.

it is the _sequence_ which should be stored as "meta-data",
and the exact nature of that sequencing meta-data should be
_the_alphabetical_sort-order_of_the_filenames_, which means
it is "meta-data" inherent in the filenames themselves, and thus
doesn't need to be "stored" anywhere, since it can be determined,
on-demand, in a manner that's obvious (even to a fourth-grader),
by anyone who requires it.  this gives the content full robustness...

(it allows other pluses, such as an easy ability to rework or remix
the sequencing, but let's just concentrate on the core values now.)

i'm telling you now that you're doing it wrong.  remember that later.
because your method is going to tangle you up in all kinds of knots.
and then i _will_ come back to say "i told you so", it's one of my flaws.

-bowerbird

***

so ok, that's how i responded to perry on this matter.

i also informed him -- in that backchannel sent at noon,
his time -- about those missing characters in those files,
but i didn't hear back from him on that serious problem.

neither have i received an answer from him on the question
whether umichigan would mind if i released a scraping app.

at any rate, let me just remind people that _i_ do not have to
administer the system that umichigan has set up for itself, so
i don't care how much complexity they saddle themselves with.

nor am i student/staff/faculty at umichigan who'll have to use
the system created for them, as inconvenient as that might be.

all i need is to get the base u.r.l. for a book so that i can scrape
the image and text-files by clicking a single button.   that's it!

so i don't care one way _or_ the other if my points are heeded.

i give this feedback to umichigan for _their_ benefit, not mine.
whether they listen or not is up to them.   my conscious is clear.

***

jon noring said:
>    After working with two scanning projects, My Antonia,
>    and the Kama Sutra, I've learned very quickly as to what NOT
>    to do, and what needs to be done to keep track of everything.

oh good lord, two projects and jon thinks he's learned enough
to set policy for a project scanning "several thousand" books...

excuse me while i go take a deep breath...          :+)

...ok, i'm back...

it is bad practice to put the sequence-number in the filename,
because -- for various reasons -- sequence-numbers change.
(one fairly obvious example is when an introduction gets added.)

then you have two versions of the same file with different names.
that's _bad_.   you don't want the same file with different names,
and you don't want different files with the same name.   capiche?

maybe on his 3rd book, his 11th, or his 242nd, jon will learn that.

-bowerbird