Book People Archive

Re: "Scan This Book!"

From: Bowerbird@[redacted]
Subject: Re: "Scan This Book!"
Date: Wed, 31 May 2006 16:44:25 EDT
lars said:
>    Just like job openings on monster.com can be
>    georeferenced and plotted on Google Maps,
>    it would be possible to plot David   Livingstone's
>    Expedition to the Zambesi on a map.  
>    Has anybody done this?   I can imagine that
>    (1) it takes some skill, time, and effort to do it
>    (but there are plenty of people who have that), and
>    (2) for it to be useful, you need readers who are
>    both interested in Livingstone's expedition and
>    your novel user interface to the text.   Here's the text,
>    http://www.gutenberg.org/etext/2519
>    Still, that kind of manual mash-up is
>    not the automatic process that Kelly describes.

ok, first of all, your last point is important, because
you've implicitly set this up as a one-of-a-kind task.

i would want to think of it in more general terms,
and with a "predominantly-automatic" orientation.

since i'm always thinking in terms of ability-to-scale,
with a top-end of -- literally -- _billions_ of elements,
you cannot devote more than a sliver of human time
in the processing of any single particular document.

of course, we know that some humans will freely give
_much_ time to a particular document they care about,
so a very important part of the infrastructure has to be
concerned with _facilitating_ that care, _capturing_ it,
and then _re-distributing_ it back to the general public.

this means _some_ documents will get good attention.
but for the most part, while each book is "waiting for"
their dedicated caregiver to come along, we need to
ensure our automatic handling of it is "good enough".

so let me tell you a story.   its relevance will become clear.

when writing my viewer-program for the p.g. e-texts,
it seemed important to me to incorporate any _pictures_
which were present in the original p-book.   why not?

people think of project gutenberg in terms of _text_ files,
but for some time, p.g. has been willing to store images.
most of the time, if a book had images, an .html version
is made that incorporates them.   the actual image-files
are stored in the "images" folder inside the .html folder.

one problem i discovered is that in the plain-text file
-- which is the one that my viewer-program displays --
there is often no indication of exactly _which_ image-file
is the one which is to be displayed at any specific location.

there's usually a note like "[illustration here]" or some
other indication, such as the caption for the illustration,
but _no_ specific indication of _which_ image-file it is!

so, if a book has 20 image files, say, there is no way to
know _which_ of those 20 is the one that goes "[here]"!

ok, so when i discovered this, i figured the p.g. people
simply assumed that a text-viewer wouldn't be able to
"pull in" a picture for display, so there was no reason to
include the file-name information in the plain-ascii file.

and i believed that if i just asked them to start including
that information, so that my viewer-program could use it
to display the correct image at each appropriate location,
they'd see the reasoning, and cheerfully include that info.
(i assumed they'd be cheerful, because this ability to show
pictures along with the text is obviously an improvement.)

so i asked them to include the filename.

and boy, was i wrong about what their reaction would be.

they flatly said no, they wouldn't include that information,
not in the plain-ascii text-file.

i was absolutely floored.

they clearly _have_ the info about which file goes where,
because they include its filename in the "img src=" tag
in their .html version.   so they actually go to _extra_work_
to _discard_ this information from the plain-ascii version!

unbelievable!

and when i asked them to _stop_ discarding that info,
they refused!

at first, i thought there must simply be a misunderstanding,
that they couldn't possibly be discarding that information
_on_purpose_.   so i persisted.   and jim tinsley let me know,
in no uncertain terms, that he knew exactly what i was asking,
and he was bound and determined that i would not receive it.

they were going to continue to discard that good information!

i could not understand that stupidity.

i still cannot.

but life has to go on, right?   besides, even if he would have said "ok",
there was still the matter of all the old texts that didn't have the info.
so i needed a solution.

my first thought was that i would have to go through and manually
match up exactly which picture went in which location, for every text.

then i would have to edit every text, in every such location, to insert
the exact filename of the specific image to be displayed at that place.

as you can imagine, that's a heck of a lot of work.

to help ease that burden, i decided to leverage the .html versions.
since each "img src=" tag gives the exact filename at the exact point
in the text, much of the work involved could be automated, not just
the detective work about what goes where, but also the grunt work of
_editing_the_files_.

but then i hit on something even better.

because even once i had edited all of those files, i would still have to
_distribute_ them, and somehow _supplant_ all of the original e-texts.

considering how widely those original e-texts have been disseminated,
that would be no small task.

what i _really_ wanted, was to be able to _use_ those original e-texts.

if someone already had all those e-texts on their machine, and the
image-files, i didn't want them to have to re-download all of them.

somehow, i had to make a system that worked with those files _as_is_.

once i had defined the problem in those terms, the answer came.

what i decided was to let the text itself, as it stood, call the picture,
and it had to call the appropriate picture at the appropriate place...

in some (rare) situations, this can happen relatively easily.

for instance, if we have images named "figure1" through "figure7",
then it's pretty easy to know exactly where those go in the e-text,
obviously, "figure1" goes where "figure 1" is mentioned in a caption,
and so on and so forth.

the spark was the idea to use each image-file's _name_ as its key.

more specifically, i would find the location where each image went,
locate some _unique_ text on that page (found nowhere else in the
entire book), and then _name_ the image-file with that unique text.

from the perspective of my viewer-program, then, what it does is to
examine the text that it has just displayed on-screen, in the normal
course of the user navigating through the book, comparing that text
to the filenames of the "available" graphic files (in the current folder).
if it finds a match, it knows that _that_ graphic-file is to be displayed.

for instance, if files named "figure1" through "figure7" are "available",
then when it displays a page that has the text "figure 4" inside of it,
it knows that it's time to display "figure4".   so it puts up the image
in any "unused space" on the page, or a thumbnail if that's all that fits.

this methodology allows people to use their original p.g. e-texts
and original image-files -- they don't have to download new ones.

the only change that is needed is to the _names_ of the graphic-files.

so i find out which image goes in a certain place, find some unique text
near that place -- i wrote a routine that highlights this "unique text" --
and then create an "edit-file" that renames that image-file to that text;
so i just have to provide this "edit-file" to the user, and my program will
do the batch rename of the image-files, and then everything "just works".

for example, here are the image-files as renamed for "alice in wonderland":
>    alice_cramped.png
>    alice_holding.png
>    alice_meets.png
>    alice_saying.png
>    alice_speaks.png
>    alice_swimming.png
>    alice_taking.png
>    alice_trying.png
>    alice_upsets.png
>    alice_watching.png
>    arrives_hastily.png
>    baby,_and_alice.png
>    bill_flying.png
>    blowing_trumpet.png
>    cat_fades.png
>    chats_with.png
>    checking_watch.png
>    dodo_presenting.png
>    dog_looking.png
>    executioner_argues.png
>    frog_servants.png
>    gryphon_asleep.png
>    gryphon_demonstrating.png
>    gryphon_singing.png
>    hand_grabbing.png
>    hare_dunk.png
>    hastily_leaves.png
>    hatter_engaging.png
>    king_reflecting.png
>    lobster_primping.png
>    mouse_swimming.png
>    mouse_telling.png
>    queen_inspecting.png
>    queen_pointing.png
>    seven_painting.png
>    stretched_tall.png
>    the_mad.png
>    tiny_door.png
>    william_balancing.png
>    william_having.png
>    william_somersaulting.png
>    william_standing.png

you will find these files here:
>    http://snowy.arsc.alaska.edu/bowerbird/alice01/alice01/
these files were originally "alice01a.gif" through "alice042.gif".

(and, for those of you who might be thinking that those numbers in
those filenames indicate their order, well, they do, in the main, and
yes, that helps to sort them out, at least once you have realized that
the first image in the book -- the frontispiece -- was alice033.gif.)

if you search my z.m.l. version of this book, you'll find that
the text-strings reflected by the new filenames occur in the
_figure-captions_ where each image should be displayed...

not only did this methodology serve to solve the problem at hand,
which was good, it also gave a potent new capability, even better.

essentially, what this methodology allows is for you to
"inject" your own content _into_ a pre-existing e-text,
simply by making that content "available" to my viewer
(i.e., by placing it into the same folder as the text-file),
and giving it a name that determines where it is "injected".

so, for instance, if you have a picture of albert einstein,
and you named it "albert_einstein.jpg" and put it in the
folder with your text-file, my viewer-app would "pop up"
the picture on any page with the words "albert einstein"...

i've since taken this capability beyond image-files into
quicktime movies, audio files, and flash presentations,
and even other text files (which pop up as annotations),
which means you can turn an existing e-text into quite
the little multi-media circus, if you really want to...

without going into all the ramifications here and now,
i'll just say that this enables a fantastic remix capacity.

however, this does take us back to lars and his question
about plotting the locations inside of the livingstone book.

to go about performing this task, i would pull from the text
all the _place-names_ that i wanted to map.   then i would
obtain the g.p.s. coordinates of these places, after which
i would obtain map graphic-files for each of the locations;
the final step would be _naming_ those graphics correctly...

and it's not difficult to imagine how we could automate that.

first, a dictionary of place-names could (and eventually should)
be incorporated into viewer-applications, and i would expect
place-name dictionaries in the future to _include_ g.p.s. data.

so our viewer-program could use the place-names dictionary
to auto-collect the place-names in an e-text, then use their
g.p.s. coordinates to download maps, and name 'em correctly.

routines could even be written that would scour photo-sites
-- like flickr.com -- using their a.p.i. to search the photos' tags
for those collected place-names, auto-download some pictures,
and auto-name them using the appropriate text-strings.   bingo.
all of a sudden, you've got yourself a travelogue.

as a lark, i pulled out the names of mountains:
>    Mount Chikala = 0
>    Mount Chiperiziwa
>    Mount Chiradzuru = 0
>    Mount Choro
>    Mount Clarendon = 18
>    Mount Manyerere
>    Mount Mochiru
>    Mount Morambala = 0
>    Mount Morumbwa
>    Mount Mvai = 3
>    Mount Njongone
>    Mount Pirimiti = 0
>    Mount Zomba = 11

the digits indicate the number of google hits on these terms
that do _not_ also include "livingstone", so we can see that
these mountains are _not_ very well represented in cyberspace.

and a search of google maps on a few of them returned no hits.
(it occurs to me now as i'm writing this that i should've searched
for the "mt." variation as well, but i won't go bother to do that.)

so you will have to wait for a while before this particular book
is a good candidate for this auto-fetching of additional content.
(right now, it's not even a good candidate for a _manual_ task.)

but i think you get the idea about how a program could go about
doing this automatically, so we wouldn't have to do it manually...

i know next-to-nothing about the mapping sites.   for al i know,
they might already help users do this type of thing automatically.

further, the idea that you'd want to do this for a book seems to be
a not-uncommon one, as evidenced by a note on this blog entry:
>    http://orweblog.oclc.org/archives/001031.html

that blog-post was written by dempsey _after_ i wrote this post,
but -- obviously -- before i submitted this post.   as you can see,
he's talking about problems caused by "duplicate" place-names,
but certainly, by tagging the state/country onto these, they are
unambiguous again, with g.p.s. data furnished for each and all.

***

if you want to do some additional thinking on things like this,
you might want to read dan cohen (http://www.dancohen.org),
who has made cool apps (http://www.dancohen.org/software),
including "syllabi finder" (http://chnm.gmu.edu/tools/syllabi)...

syllabi finder goes on the web and finds syllabi (yes!), by searching
for certain criteria that it has found to be "earmarks" of a syllabus
(the main one being the file contains the word "syllabus" -- d'oh,
sometimes these things really _are_ a lot simpler than they seem),
and then collects the syllabi in a database browsable by the public.

another good example is some research that i.b.m. has been doing.
led by irving wladawsky-berger, work on "unstructured data retrieval"
is exactly up the alley of this question of "books reading each other".

jon udell gives a good write-up on these computerized "analyzers":
>   Using a potpourri of technologies, these analyzers pore through
>   unstructured text looking for named entities (people, places,
>   companies, or products, for example) and relationships among them.
>   Then the analyzers tag these entities to enable structured search.
>   Queries are XML fragments that can nest entities, such as "person"
>   and "organization", inside relationships, such   as "president_of"
>
>   If such tagging were already present in the document, or linked to it
>   by way of an external tagging service, you could skip the rocket-science
>
>   analysis phase and proceed directly to the query endgame.
>
>   Yeah, sure, and if pigs had wings they could fly. UIMA quite reasonably
>   assumes that people cannot and will not compose texts using semantic markup
>   to denote entities and relations. It also assumes that the semantic clues
>   we can find on the public   Web -- thanks to linking and, more recently,
>   social tagging -- won't be as available in the enterprise, given its
>   vastly smaller scale and complex security regime.
>
>   http://www.infoworld.com/article/05/08/17/34OPstrategic_1.html)

i love an observation which takes the assumption that ordinary people will
do semantic markup and responds with "if pigs had wings they could fly".

face it, folks, we're going to have to figure out this information
automatically,
using routines that attack the content as it actually exists out in the real
world.
i've been saying this for years.   it's about time i.b.m. started catching
up.      :+)

the research is also written up here:
>    http://irvingwb.typepad.com/blog/2005/08/index.html

lars, i hope these past two posts have shown that i'm not _totally_ clueless
on this general question of "having the books read each other".   i certainly
don't have all the answers, which is why i raised the question to this list,
but i've done enough research to have some budding ideas about what to do...

-bowerbird