Book People Archive

on duplications

From: Bowerbird@[redacted]
Subject: on duplications
Date: Wed, 10 May 2006 14:26:07 EDT
david said:
> There may be various editions, but most public libraries
> won't waste space and budget on duplicate books.

that's true.

to a great extent, anyway.

however, it's something of
an empirical question too,
most especially _collectively_,
so it's a good thing someone
has looked at relevant data.
(the analysis is appended.)


> And unlike in the physical world,
> duplicate etexts are a complete waste

again, that's true.

but it's also true that, unlike in the physical world,
the resources of "space and budget" that are being
"wasted" are less scarce.  so even if a cyberlibrary is
50% redundant, it's not really _that_ big of a deal...


> if you want 1/3 million ebooks, I can make that
> on my hard drive in a couple minutes, if you
> let me count duplicate etexts or complete trash.

and you'd still have plenty of disk-space, most likely,
and the "cost" increase will have been quite negligible,
to reiterate the point i just made.  indeed, the cost that
is brought to the table by such duplication that's _most_
significant is the human cost of negotiating the catalog.

and of course, you wouldn't have any more books than
when you had started, so there's no reason to do that
exercise, which is the point you were making, i grasp.

the task in front of us, though, is a little less avoidable.
since michael and friends have collected e-books from
throughout cyberspace, there will be some duplications.

yes, these are, as you have aptly pointed out, wasteful.
but it will take time and energy to cull out the duplicates.
does the "waste" we are experiencing merit allocation of
the time and energy to do that culling?  i'm not convinced.

i think to the extent we can do a quick-and-dirty dismissal
of obvious duplicates, we should.  but a more thorough job
should probably be delayed until it can prove its worth...

after all, what michael is doing is trying to get the attention
of the general public to slap them upside the head with the
newspaper headline that "there are a heckuvalotta e-books."

i mean seriously, he has also compared the _weight_ of this
library-filling stack of books to that of "an elephant herd",
so it's fairly obvious that these are _conceptual_ arguments.
asking him to specify "the number of elephants in that herd?"
indicates that one doesn't understand the nature of the point.

in the same way he enticed the masses to latch onto the idea
of an "e-book" by typing in the declaration of independence,
michael is now framing the concept of "an electronic library",
allowing people to see his collection of files on the web as
the parallel to their building full of books down the street...

let him stretch their brains.   they need it.

-bowerbird

p.s.  this research by lorcan dempsey brings data to bear.
the focus was on the 5 libraries that google is digitizing.
read through all the conclusions -- the last is interesting.
(and david, there's language stuff in there you'll appreciate.)

>   http://orweblog.oclc.org/archives/000800.html

* The proportion of the system-wide collection actually
covered by the Google 5 libraries, once duplicate holdings across
the five institutions are removed, is about one third (33 percent), or
10.5 million unique books out of the 32 million in the system-wide
collection.

* The pattern of cross-collection overlap implies that if each collection
were fully digitized, about four out of every ten books would be re-digitized
at least once, or in other words, the Google Print Libraries project reflects
a minimum redundancy rate of about 40 percent.

* The pattern of cross-collection overlap suggests that research library
collections may be less 'vanilla' than might have been thought,
or in other words, as the article states, 'rareness is common'.
Only 3% of the books in the 10.5M are held by all five libraries.

* More than 430 languages were identified in the Google 5 combined
collection.  English-language materials represent slightly less than half
of the books in this collection; German-, French-, and Spanish-language
materials account for about a quarter of the remaining books, with the
rest scattered over a wide variety of languages.  At first sight this
seems a strange result: the distribution between English and non-English
books would be more weighted to the former in any one of the library
collections. However, as the collections are brought together there is
greater redundancy among the English books.

* Approximately half of the print books in the combined Google 5 collection
were published after 1974.  Almost three-quarters were published after the
Second World War.  Using the year 1923 as a rough break-off point between
materials that are out of copyright and materials that are in copyright,
more than 80 percent of the materials in the Google 5
collections are still in copyright (this is of course an upper bound).