Re: Dublin Core records available for The Online Books Page database
- From: John Mark Ockerbloom <ockerblo@[redacted]>
- Subject: Re: Dublin Core records available for The Online Books Page database
- Date: Wed, 18 Jul 2007 16:13:14 -0400
[Here's an exchange with Kat Hagedorn of OAIster, who I'd cc'd on the
Dublin Core records announcement,
that I mistakenly diverted from the list. We return to it now,
with a summary of previous discussion.]
Kat Hagedorn writes:
>>> This is fantastic. However I try not to duplicate data in OAIster. I'm
>>> perfectly happy to harvest selectively by set, though. Would it be possible
>>> to put records from different providers in their own sets?
I replied:
>> Not easily, since I have hundreds (possibly thousands) of different providers,
>> and some of the records include URLs from multiple providers (either as
>> alternative links, or in a few cases because no one provider had all the
>> volumes of a multi-volume work).
>>
>> If I were to break it down by provider, it would probably be based on URI.
>> You're welcome to look at the URI to throw out dupes from known repositories
>> if you like, if that fits with your workflow. I probably am going to provide
>> some provider sorting at some point (since some users have asked for it as
>> an advanced search feature) but it's not likely to happen anytime soon.
She wrote:
> Could you list off some of the biggest providers you aggregate? I would
> really like to harvest your material, but if most of the records come from
> providers I already have, then I'll have to pass, for now.
Here's a very quick rundown, based on some quick grep and Perl scripts:
28,551 active records (ones with at least one active link)
31,525 active links
2,757 domains represented in the links
Top providers and notes (based on server domain,
or in some cases higher-level domain):
7460 onlinebooks.library.upenn.edu
This is for links that for some reason link back to a cover page on my own
site. The main subsets are
6911 Project Gutenberg links (there may be multiple such links per record)
340 "no US access" cover pages (public domain outside US, but not inside)
201 serial cover pages (not all listed serials have these)
5 complex book cover pages (where I had to stitch together items
with something more than just a list of links)
1620 www.canadiana.org -- Early Canadiana Online
1419 name.umdl.umich.edu -- Making of America, and a few other Michigan titles
980 *.nap.edu -- National Academy Press
700 *.loc.gov -- Library of Congress (mostly American Memory)
620 etext.lib.virginia.edu -- University of Virginia EText Center
598 www.sacred-texts.com -- Internet Sacred Texts Archive
472 ark.cdlib.org -- U California (mostly free UC Press titles)
413 www.ccel.org -- Christian Classics Ethereal Library
376 docsouth.unc.edu -- Documenting the American South
335 www.athelstane.co.uk -- Athelstane (Nick Hodson)
306 digital.library.upenn.edu -- A Celebration of Women Writers + a few others
303 www.hti.umich.edu -- Michigan's Humanities Text Initiative
300 www.bartleby.com -- Project Bartleby
264 *.indiana.edu -- Indiana (mostly Victorian Women Writers
and some Wright American Fiction)
241 www.wws.princeton.edu -- Office of Technology Assessment archive
228 *.archive.org -- Open-Access Text Archive and Wayback machine
archived copies
228 www.cimmay.us -- Cimmay (Michael Rivard)
202 www.mainlesson.com -- Baldwin Children's Literature Project
188 *.cornell.edu -- Cornell, mostly from the Math Book collection
154 www.perseus.tufts.edu -- Perseus Project
174 www.cs.arizona.edu -- Arizona (mostly their textiles collection)
145 digital.lib.msu.edu -- Michigan State special collections
131 www.eldritchpress.org -- Eldritch press (Eric Eldred)
127 digital.library.wisc.edu -- Wisconsin digital collections
120 www.horrormasters.com -- Horror Masters
117 www.bibliomania.com -- Bibliomania
114 books.iuniverse.com -- IUniverse
111 www.kellscraft.com -- Kellscraft
110 socserv2.mcmaster.ca -- McMaster (Rod Hay's economics collection)
107 darkwing.uoregon.edu -- Renascence Editions (Risa Bear)
(and a couple thousand more domains below this, but those are the ones that
have more than 100 links in my current database. There's clearly a lot
more than could be listed from many of these sources-- Google Books, for
instance, probably has over 100,000 free volumes by now but only 84 entries
in my database. But that's what I've got now.)
I hope this helps,
John