Book People Archive

Google Books PDFs (was: More About Google's Deal with U of Cal)

From: John Mark Ockerbloom <ockerblo@[redacted]>
Subject: Google Books PDFs (was: More About Google's Deal with U of Cal)
Date: Thu, 31 Aug 2006 10:47:04 -0400
Nick Hodson quotes the Google help page on PDF viewers, which I found
after making my post:

> We recommend that you use the free Acrobat Reader version 7 for viewing PDFs
> downloaded from books.google.com. If for some reason you can't install
> Acrobat Reader, we also recommend the following applications:
> 
>   a.. xpdf. If you find that xpdf crashes with JBIG2 errors you should apply
> the patch available here.

I've upgraded my Xpdf from 3.00 to the latest version (3.01, as packaged by
CSW for Solaris), and indeed I can now view all the pages of _The Slipper
Point Mystery_ without difficulty or crashes.)

> If you are a Mac user, Preview on OS X 10.4 is not known to have any issues
> with these PDF files, although users of earlier versions of Mac OS may
> experience problems.

That seems to explain my problems with Preview (I'm running OS X 10.3).
Apparently, Preview didn't get full PDF 1.5 compatibility until OS X 10.4,
and the PDF for _The Slipper Point Mystery_, and possibly other books,
uses PDF 1.5 features (despite identifying itself as a PDF 1.4 file).
That also explains my troubles with an earlier Acrobat 5 plugin; you need
Adobe Reader 6 or later to be able to decode parts of the file that
depend on PDF 1.5 features.  (Google recommends Adobe Reader 7; I haven't
checked to see how well the files do in Reader 6.)

Nick continues:

> My impression is that some of these scans are searchable, but not all.
> Anyway, the scans that Google has of a book are searchable.

Have you found any PDFs that are searchable?  _Slipper Point_ is searchable
on Google's own web site, since they have both image and OCR data, but
it looks like they're only putting image data into the PDF they export
(hence the downlaoded PDF isn't searchable).

> The pdfs are indeed compressed to a surprising extent. I have downloaded one
> that I thought ought to be about 12 megabytes, to find it scarcely more than
> three. Even more surprising is the way in which you can zoom in on the text
> without it becoming jagged. There is smooth text at 800 magnification, and
> very slight jaggedness at 1600. This would be a bit like a DjVu file saved
> in the unsearchable mode. It isn't a DjVu, by the way.

In this case, from a quick scan of the PDF tags, it appears to use JBIG2
(supported in PDF 1.4 and above) and JPEG-2000 (supported in PDF 1.5
and above).  I'm guessing it's using JBIG2 for the black and white pages
and JPEG-2000 for color.  I'm not an expert on these encodings, but I'm told
that they have compression features that are similar to DjVu's, so it's not
surprising you're seeing similar zooming characteristics.

The files are not completely standards-conformant, which I can tell simply
by the fact that they identify themselves as PDF 1.4 while using constructs
not defined in 1.4.  (The version identified in the file header can be
legally overridden in the root Catalog object, but they don't seem to have
done that.) The PDF standard also imposes certain constraints about how JBIG2
and JPEG-2000 images can be encoded and embedded in a PDF file; I can't tell
at a glance whether they condorm to those or not, like I can for the
version requirements, but if they don't, that might explain problems in
some other PDF viewers.  It seems fairly clear, though, that you'll need
a program that can handle PDF 1.5 to have any hope of making full use
of the files.

> But here is the good news! ABBYY FineReader 8 will read the pdf (at least it
> read the one I downloaded as a sample), but extremely slowly, about thirty
> seconds per page. Now, why would that be? Still, if you leave it working on
> the pdf for three or four hours, you have the entire book read in. You can
> save each page as a separate tiff, or in whatever format you like, or you
> could save the entire book as a more conventional pdf.

Great!  I'll be interested to know how well the OCR comes out.

John