Re: [gweekly] Project Gutenberg Weekly Newsletter -- Week #33-2007
- From: John Mark Ockerbloom <ockerblo@[redacted]>
- Subject: Re: [gweekly] Project Gutenberg Weekly Newsletter -- Week #33-2007
- Date: Tue, 28 Aug 2007 14:31:00 -0400
Klaus Graf wrote:
> I have kindly asked for a list of all PG ebooks with page scans. It
> was not possible to get one. You have to click through 20.000+ ebooks
> to find out which have page scans.
While, as far as I know, Gutenberg itself does not export a machine-readable
master list of formats (they have an RDF catalog file, but it didn't
give format information last I checked) there are some alternatives to
clicking through 20,000+ titles to find the ones with page images.
I wrote a robot directory-scanner for Gutenberg that identifies formats of
the Gutenberg titles, for my own Gutenberg multiplexer. While Gutenberg
discourages robots for getting information that's available in their
downloadable bundles, they allow them if they follow the rules given at
as mine does.
Most of the page images posted to date are for Gutenberg etext numbers above
20,000, and I haven't yet scanned most of that range. (And my scanner also
sometimes complains if the filenames aren't in the form the scanner expects,
which sometimes happens.) But if you'd like to use something like this to look
for etexts with page images included, let me know and I can run or
send you an adaptation of my code. (For your purposes,
I'd probably modify it to only look for page image directories, so as to
make it easier to run it on autopilot without it getting confused by other
At the moment, books with page images included still appear to in the
minority of new Gutenberg releases, though I have seen a number of
batches of page images brought on some weeks after the initial etext
releases. On the other hand, books with *original source citation* included
now appear to be the norm for new Gutenberg releases (as well as for
reposts of old Gutenberg titles), which is a welcome development.
I hope this helps.