Book People Archive

PDF, DRM, and "open" formats, part 1



There's been some discussion lately of PDF and DRM, and open versus
proprietary formats.  I thought it might be worth talking a bit about
these, both to clear up possible misunderstandings about what PDF
files are, and to evaluate other online book formats that folks
might be interested in.

In this post, I'll give an overview of the PDF standard, discuss the
extent to which it's open and it's proprietary, and describe the optional
DRM (digital rights/restrictions management) available with the format.
I hope to follow this up with a post talking more broadly about the
implications of the PDF standard, and other standards like it, for
online books and formats that are intended to be "open" and freely usable.
Before I start, I should note that I'm not a lawyer (so don't rely on me
for legal advice) and am not an Adoobe insider or PDF hacker (so I could
be wrong about some of the technical details).  This is just what I've
gleaned from published sources on PDF.

PDF (Portable Document Format) is a format invented by Adobe for its
Acrobat and Adobe Reader software.  It is widely used to represent
digital books and documents in a way that embodies not just a particular
text, but also a particular look and page layout.  This is sometimes what
some readers look for in digital editions.  (For example, I recently
listed some PDF editions of books that replicate the look of particularly
significant old print editions.  I already had listed other versions of the
books in HTML for those who just wanted to use the text.)  The alternative
digital book formats either do not have as broad reader software
support, or do not allow as detailed a specification of appearance, or
are not neatly packageable in a single portable file.
(There are lots of reasons *not* to use PDF for many uses of online
books, but there are some uses for which PDF is currently arguably
the best choice.  Your priorities may vary.)

The official specification for PDF is controlled and copyrighted by Adobe.
This type of arrangement is not universal, but it's not unusual either.
Even the official specification for ASCII is controlled and copyrighted
by ANSI, though ASCII as used by projects like Gutenberg is straightforward
enough that it's easily restated as a table of numbers and characters
that I doubt would be subject to ANSI copyright.  Of course, ANSI is a
nonprofit organization, whereas Adobe is a for-profit company.

Adobe publishes, and makes freely available online, a specification
of PDF.  This allows PDF files to be created and read by a variety of
software programs, both those by Adobe, and by other parties, including
groups that produce open source PDF software.   Specifications for the
various versions of PDF that have been released can be downloaded from

    http://partners.adobe.com/public/developer/pdf/index_reference.html

Several versions of the PDF specification have been published.  Each one
has its own versisn number, and every PDF file also specifies which version
of PDF applies to it.  The specification has grown in length over time, with
the latest version running over 1000 pages.  As far as I'm aware, the various
versions of PDF are largely backward compatible -- that is, newer PDF programs
usually have little trouble reading older versions of PDF.  The reverse
is not always true; some files that use newer versions of PDF are not
displayable in older PDF reader programs.

Adobe also claims copyright on the "data structures and operators" used in
PDF files, with the stated reason of protecting the integrity of the format,
though it's unclear to me whether this copyright claim is legally enforceable.
In any case, Adobe gives blanket permission for anyone's software to create
and use PDF files as long as the standard is followed, Adobe's copyright
notices are retained, and the software does not attempt to bypass use
restrictions that might be specified by a given PDF file. (In some
countries, such as the US, laws like the DMCA also prohibit bypassing DRM of
copyrighted files in mnay cases, independently of Adobe's claims.)

So, it's possible to create PDF files that are based completely on openly
published specifications, amd that can be created, displayed, and analyzed,
by a wide range of software, including open source and personally customized
software.  Most PDF files I encounter are open like this, and I read them
just fine in programs like Xpdf, an open source PDF viewer for Unix, and
Preview, a third-party viewer Apple created for the Mac.  (That's
what I'll typically use to look st PDFs; I don't actually use Adobe's
software or plugins very often.)

It's also possible to create files with digital restrictions (or DRM).
The simplest form of DRM in PDF is a set of bits in the file that say
"please don't allow printing", "please only print at reduced fidelity",
"please don't allow cut-and-paste", etc.
These are technically easy for software to circumvent (just ignore
or reset the bits) but doing so may be legally prohibited in some
jurisdictions, for reasons detailed above.

Content may also be encrypted, using algorithms published in the PDF
specification.  (There is also one unpublished algorithm mentioned, but
in the current standard that one is deprecated, and current versions of
Adobe Acrobat do not encrypt using it.  Adobe claims that at one point US
Department of Commerce restrictions required them to not publish
one algorithm for export, but that the restriction no longer holds,
and they now prefer different, openly published algorithms.  I don't
know offhand if the "unpublished' algorithm was the same one that
Russian programmer Dmitri Sklyarov cracked.)

If content is encrypted, you need to have a key to decrypt the content
so that you can read it and use it.  This key might be provided by
a user-entered password.  Or, it could be provided by a proprietary
extension in certain PDF reader programs, possibly based in part
on extra binary data embedded in the PDF file.  (I believe, though I'm
not positive, that Adobe Reader uses proprietary extensions to unlock
the "content-protected" PDF titles that some publishers sell.  An alternative
would be to take an ordinary PDF file and "wrap" it with some proprietary 
encryption, which is what's done when, say, OEBPS books are packaged
as proprietary Microsoft Reader files.  But that would appear to be more
complicated than necessary here, given PDF's built-in encryption capability.)

PDF can include both encoded text and images.  It's possible to create
a PDF that consists only of scanned images (and some of the larger PDFs I
list are of this type).  These PDFs will tend to be large, and will
not support cur-and-paste of text, since as far as the viewing program is
concerned, there's no text to copy, just a picture.  Other programs
will combine text and images side by side or even overlaid (so that
the file encodes both the text of a word and the appearance of that
scanned text on the page).  Some programs also support optical
character recognition (OCR) of images in PDF files, allowing you
to copy the characters the program recognizes in the page image.  How clean
a copy you get, though, may depend on the page image quality and
the capabilities of the OCR program.  (Some programs may then try
to compress the file based on the text that was recognized.  You have
to watch this when preparing PDF files with OCR iu some programs,
because some of them might "correct" the appearance of a mis-recognized
character to match the letter the program erronesouly inferred.)
PDF files can also include other information, including tables of
contents, annotations, Javascript code, and descriptive and
structural metadata.

It should be clear by now that PDF as a whole covers many kinds
of files.  Some PDF files are text and images with no DRM that
can be easily displayed, preserved, and processed by a wide variety
of programs, no matter what Adobe or publishers may do in the future.
They can be encoded page images that you can display, and perhaps OCR, but
can't do much else with.  They can use encryption keys, restriction flags,
and proprietary key-provision methods that make them unusable except
through tightly controlled reader programs.

In short, on a spectrum of "open" to "closed", a given PDF file might
be effectively open, or effectively closed, or somewhere in between.
It depends on what's in the file, and how it's encoded.  (Indeed,
some communities have developed particular PDF profiles intended
for different user expectations.  The PDF/A standard, for example,
covers PDF files that use only a subset of PDF's capabilities, and that
are expected to be easy to preserve and reuse in digital libraries
and archives.)

On The Online Books Page, I list PDF files either of the "text with images"
variety or the "page images only" variety, depending on what's available
and what seems to be most useful to people.  I haven't listed any
encrypted PDF files to my knowledge, and don't intend to unless the
key is freely and openly available (in which case it might as well
not be encrypted.)   Some of the files I list might have the
"please do not print" etc. bits turned on; I generally haven't checked.

All else being equal, for any given book I'll prefer PDF files that have
few or no restrictions over PDF files that have more restrictions,
so feel free to inform me if you find any PDF files I list that have
annoying restrictions, when good alternatives might be available.
(If the books are copyrighted, you're probably stuck with what the
publisher provides, but quite possibly not if the books are public
domain, or under a suitably liberal Creative Commons license.)

PDF is not the only online book format that can in practice be
either "open" or "closed" or somewhere in between.  In a followup
post, I hope to describe how some other formats can end up being
"open" or "closed" in practice, including formats that were initially
intended or promoted to be "open".  Stay tuned...

John Mark Ockerbloom