Book People Archive

Part 2 of PDF, DRM, and "open" formats

I'd like to talk some more about what makes a format open and desirable
for use for online books and documents.  In particular, I'm going to be
discussing other digital document formats, including OpenReader, which
Jon Noring just announced a spec for.  (Congratulations, Jon!)

In my "Part 1" post, I talked a lot about PDF, as a useful specific
(and fairly complex) example of an openly specified format that has
notable benefits and drawbacks.  My reason for doing that was not to promote
PDF-- though I believe it is a useful format for certain types of digital
publication-- but to give a factual grounding for evaluating it and
other formats.  Whether you like, dislike, or are indifferent to PDF, I hope
you found that analysis helpful.  (The ensuing discussion here was also quite
useful, and I thank everyone who participated in it.)

It's interesting to note that PDF's main benefits and drawbacks are often
the same features in different contexts.  For example, PDF's detailed page
layout and formatting capabilities give authors and publishers more control
over the appearance of their work than most other formats do-- but they also
make PDF documents more difficult to reformat for *other* views and uses than
other formats.  The wide range of capabilities that PDF supports also make it
difficult for a user to know in advance what they can do with a random PDF
document.  They might not have tools that can effectively use all of PDF's
capabilities, or the document itself might not support-- or might even
block-- the capabilities the user is wanting to use, such as copying text.

Similarly, PDF's DRM design still permits etext providers or publishers to
offer documents with few or no restrictions if they choose-- but the
*possibility* of those restrictions can make customers suspicious about what
gotchas might lurk in documents they're considering buying.
(Yes, if you have the right tools, you can tell whether a given PDF document
has DRM restrictions, and to some extent what kind of DRM.  But not everyone
has those tools, and you can't run them on a PDF document you haven't yet

This example illustrates an important point about "open" formats.  When
we talk about PDF being "open" in the sense that it has an openly published,
freely available specification that can be used with few restrictions,
we're talking about a kind of "openness" that just about everyone would
agree is desirable.  The *complexity* of the PDF spec may make this openness
difficult to take advantage of in practice, but the variety of PDF tools now
available from various sources shows that it can done, particularly when
dealing with simpler PDF documents such as in PDF/A.

When we talk about PDF being "open" in the sense that its functionality
is largely open-ended, we're talking about a kind of "openness" that's
*not* necessarily desirable.  Users often want limits to open-endedness
in their formats and programs.  I'd like to know that an email message
that I view or a music disc I put in my CD-ROM drive is not going
to infect my computer with harmful code, as is now possible with some
formats and systems.  I'd like to know that the books I buy are going
to be ones that I can keep and pass along as I wish, and not self-destruct
or become unusable due to DRM or propeitary formats combined with shifting
technology and bueiness practices.  These concerns may lead me to prefer
formats and programs that *exclude* certain features.   For example,
archivists tend to be more comfortable with PDF/A than with PDF in general,
because PDF/A restricts the range of features in PDF documents to mechanisms
that archivists believe will be easier to support and migrate over time.

The PDF example also shows that a document is not truly "open" if its
meaning and use depends on something that's *not* open.  Suppose I have a PDF
file that's encrypted using a key that's under the control of a proprietary
system.  The fact that the PDF specification itself is open
doesn't give me any assurance about how I can use the document, or whether
I can use it at all.  Instead, I am subject to whatever restrictions the
proprietary system imposes before I can take advantage of the open
specification of PDF and use the file.  A corollary of this principle
is that you can't honestly market a digital format to consumers as "open"
if you're happy to let people advertise books as being in that "open"
format despite having proprietary dependencies.

The recent debates over "open" office formats make this principle clear.
Microsoft's "Office Open XML" and OASIS' OpenDocument format (ODF) are
vying to be recognized as next-generation standard office formats.
As I understand it, both formats have openly published specifications,
but ODF specifically prohibits including arbitrary binary sections
that can affect how a document is displayed or interpreted, whereas
Microsoft's specification allows this.  Critics of Microsoft's format
charge that those arbitrary binary sections are in effect Trojan horses
that allow Microsoft to "embrace and extend" a supposedly open standard so
that its Office documents are still only usable through proprietary
mechanisms that they control.  That's why some government agencies have
stated they'll accept ODF, but not Microsoft Open XML, as a standard
for long-lived documents and publications.

In the ebook world, "open" formats for commercial publishing have been
discussed for years, and the first "Open Ebook Publishing Structure"
(OEBPS) specification was released in September 1999.  In the press release
accompanying the first specification, the promoters wrote "The
specification is expected to accelerate the availability of electronic
reading material, because the single universal format will work on all
reading systems that are compliant with the specification.... The OEB
standard means that publishers can format their content once and still
make it available on all devices and software that support the Open
eBook specification.  This is a huge win for consumers..."

So, do you have any books in this single universal format that's now
been around for nearly 7 years?  Me neither.  That's because virtually
no one sold books in this native format.  To make a nice-looking book,
you needed to include supplementary files for pictures, fonts, and the
like.  The OEBPS format didn't specify a way to package these along
with the main text file.  Instead, publishers came up with their own packages,
and they were virtually all proprietary.  So, while the OEBPS format may have
been "open" for publishers, it did nothing to help consumers
use books the way they wanted on the platforms they chose.  Many of the
proprietary books now being sold do in fact include encoded OEBPS
documents wrapped up in some other format, but that makes no difference
to the consumer, who can't use the OEBPS directly.  And we now have what
I would consider a relatively tiny, unattractive consumer market for ebooks.

(I should note that the *academic library* market for electronic material, as
  opposed to the consumer market, is far from stagnant.  But in our market,
  nearly all the online content Penn pays for is provided in standard,
  non-DRM formats like HTML, unencrypted PDF, and standard image formats
  that will display in just about any browser.   There are also some
  companies selling DRM-format books for undergrad audiences, but they're
  a relatively minor part of the market, and our library spends little
  or no money on them.)

Now it's 2006, and many people would like to see consumer publishing get
it right this time.  That's where OpenReader claims to make a difference.
The OpenReader front page talks about the importance of consumers not having
to worry about "the software they use to read downloads of electronic
publications", or having their ebooks depend on the vendor staying in
business, or not surviving technlogy upgrades, or otherwise subject to
proprietary shackles.  There's an implicit hope or promise expressed here
that OpenReader will help with these problems, just as there was hope
in 1999 that OEBPS would.

Indeed, the OpenReader folks have just released their first spec.  And,
as one of the people who had been prodding them for a long time to release
specs,  I want to express my congratulations and thanks.  But the Binder spec
that's just been released, like the original OEBPS, does not cover the
complete ebook.  I should add that the OpenReader folks don't claim that it
does-- they and I both acknowledge that it's just the first step.  The more
crucial step, though, and the one that will determine whether OpenReader can
do what's been promised for readers or just become another OEBPS, hasn't
been taken yet.  That step is fixing the rules on what gets to be called
an OpenReader book and what doesn't.

This is particularly important because the main OpenReader promoters on the
Net have indicated that their vision of OpenReader includes DRM.  But they
haven't been clear to date on what *sorts* of DRM will be permitted and
what won't.  I worry that if they choose a standard that's too open-ended
(or allow one by default by *not* choosing a standard) that they'll end up
with the same problems that encrypted PDFs have-- namely, that readers can
end up just as hamstrung by proprietary restrictions as they were with
other formats.  (OpenReader promoters, on the other hand, apparently worry
that if they're too strict with DRM standards, or they don't permit DRM
at all, that publishers won't use the format to begin with.  On the other
hand, there are lots of people on this list who are perfectly willing to
"publish" works online in suitable formats without any DRM.)  It's also
clear from our discussion to this point that many, though not all, forms
of DRM effectively make a format no longer "open", at least as actually used.

OpenReader is of course not the only possible next-generation digital book
format.  Some people on and off this list are doing interesting things
with formats like ODF, DjVu, ZML, TEI, Wiki formats, or more direct followons
to OEBPS, or they are finding clever new ways of using older formats in new
and interesting ways.  Are we about to see breakthroughs with what we do with
these old and new formats?  Or are we just going to party like it's 1999?
I'm not sure myself, but I'll stop here, and would love to hear what other
folks have to say.