Re: What large collections are out there?
- From: John Mark Ockerbloom <ockerblo@[redacted]>
- Subject: Re: What large collections are out there?
- Date: Tue, 15 May 2007 09:52:34 -0400
Greg Lindahl wrote:
> I am building a site which aggregates and classifies online books, and
> so I'm on the lookout for metadata of large collections of online books.
> Currently I use archive.org, Project Gutenberg, and the Online Books
> Page data.
> Are there any other collections of > 10,000 books out there? Google is
> one example, but they don't make it easy to get their metadata.
The University of Michigan has an OAI feed for over 200,000 items, at
Not all of these are books, but they include some large sets that are.
For example, their set "oaiall:moabib", for Making of America Books,
has more than 10,000 volumes in it. (Note that multi-volume works get
a record for each volume in their feed; in my own index, I coalesce
multi-volume works into single entries, unless the volumes often stand alone.)
You might also be interested in their set "oaigroups:freetcbib", which
supposedly gives records for all their freely available text collections.
(I haven't yet looked at this in detail, but I presume it contains the
Making of America records plus other applicable collections. Some of
these texts are probably non-book texts.)
If you want to harvest these sets, there's various software that you can use.
I know you're aware of the Perl library OAI::Net::Harvester, since you
sent me some code that uses in in backchannel. I've had some issues with
that library (for instance, I'm having trouble using it to get a full list
of sets from the Library of Congress' American Memory OAI feed), but it seems
to work fairly well for harvesting records. There are other packages as well
for other languages.
I'm glad to see people working with large-scale aggregation! I've started some
experiments toward that end myself, though they haven't yet been incorporated
into The Online Books Page listings.
One issue that I'm working on now, that I'd like to make some headway on before
I start doing large-scale integration, is normalizing subject metadata for
easier exploration. When I was at DLF I gave a talk about issues involved
in browsing subject across aggregated collections, with normalization being
one of the issues I brought up. I'll be posting my slides
on that talk shortly, and will be happy to give a pointer to them when