Book People Archive

Re: Dublin Core records available for The Online Books Page database



[Here's an exchange with Kat Hagedorn of OAIster, who I'd cc'd on the
  Dublin Core records announcement,
  that I mistakenly diverted from the list.  We return to it now,
  with a summary of previous discussion.]

Kat Hagedorn writes:

>>> This is fantastic. However I try not to duplicate data in OAIster. I'm
>>> perfectly happy to harvest selectively by set, though. Would it be possible
>>> to put records from different providers in their own sets?

I replied:

>> Not easily, since I have hundreds (possibly thousands) of different providers,
>> and some of the records include URLs from multiple providers (either as
>> alternative links, or in a few cases because no one provider had all the
>> volumes of a multi-volume work).
>>
>> If I were to break it down by provider, it would probably be based on URI.
>> You're welcome to look at the URI to throw out dupes from known repositories
>> if you like, if that fits with your workflow.  I probably am going to provide
>> some provider sorting at some point (since some users have asked for it as
>> an advanced search feature) but it's not likely to happen anytime soon.

She wrote:

> Could you list off some of the biggest providers you aggregate? I would
> really like to harvest your material, but if most of the records come from
> providers I already have, then I'll have to pass, for now.

Here's a very quick rundown, based on some quick grep and Perl scripts:

28,551 active records (ones with at least one active link)
31,525 active links
 2,757 domains represented in the links

Top providers and notes (based on server domain,
  or in some cases higher-level domain):

7460 onlinebooks.library.upenn.edu

    This is for links that for some reason link back to a cover page on my own
      site.  The main subsets are

     6911 Project Gutenberg links (there may be multiple such links per record)
      340 "no US access" cover pages (public domain outside US, but not inside)
      201 serial cover pages (not all listed serials have these)
        5 complex book cover pages (where I had to stitch together items
            with something more than just a list of links)

1620 www.canadiana.org    -- Early Canadiana Online
1419 name.umdl.umich.edu  -- Making of America, and a few other Michigan titles
  980 *.nap.edu            -- National Academy Press
  700 *.loc.gov                 -- Library of Congress (mostly American Memory)
  620 etext.lib.virginia.edu -- University of Virginia EText Center
  598 www.sacred-texts.com   -- Internet Sacred Texts Archive
  472 ark.cdlib.org          -- U California (mostly free UC Press titles)
  413 www.ccel.org           -- Christian Classics Ethereal Library
  376 docsouth.unc.edu       -- Documenting the American South
  335 www.athelstane.co.uk   -- Athelstane (Nick Hodson)
  306 digital.library.upenn.edu -- A Celebration of Women Writers + a few others
  303 www.hti.umich.edu         -- Michigan's Humanities Text Initiative
  300 www.bartleby.com          -- Project Bartleby
  264 *.indiana.edu           -- Indiana (mostly Victorian Women Writers
                                   and some Wright American Fiction)
  241 www.wws.princeton.edu   -- Office of Technology Assessment archive
  228 *.archive.org           -- Open-Access Text Archive and Wayback machine
                                     archived copies
  228 www.cimmay.us             -- Cimmay (Michael Rivard)
  202 www.mainlesson.com        -- Baldwin Children's Literature Project
  188 *.cornell.edu             -- Cornell, mostly from the Math Book collection
  154 www.perseus.tufts.edu     -- Perseus Project
  174 www.cs.arizona.edu        -- Arizona (mostly their textiles collection)
  145 digital.lib.msu.edu       -- Michigan State special collections
  131 www.eldritchpress.org     -- Eldritch press (Eric Eldred)
  127 digital.library.wisc.edu  -- Wisconsin digital collections
  120 www.horrormasters.com     -- Horror Masters
  117 www.bibliomania.com       -- Bibliomania
  114 books.iuniverse.com       -- IUniverse
  111 www.kellscraft.com        -- Kellscraft
  110 socserv2.mcmaster.ca      -- McMaster (Rod Hay's economics collection)
  107 darkwing.uoregon.edu      -- Renascence Editions (Risa Bear)

(and a couple thousand more domains below this, but those are the ones that
  have more than 100 links in my current database.  There's clearly a lot
  more than could be listed from many of these sources-- Google Books, for
  instance, probably has over 100,000 free volumes by now but only 84 entries
  in my database.  But that's what I've got now.)

I hope this helps,

   John