Book People Archive

Re: Why can't I manage [my text collection] like MP3s?



William Adams writes:
> The real problem here is the lack of standardization of metadata --- it's 
> really a shame Apple or Microsoft doesn't build in access in the filesystem to
> the metadata which is in pdfs.

William,

I've looked at the metadata in PDFs, and it's pretty poor -- certainly
not something you want to depend on.  In my field, most PDF papers are
produced from Word or LaTeX, and they usually don't have any
interesting metadata in them because the tools don't make setting that
metadata a priority.

The useful metadata tags defined by the PDF standard are "Title",
"Author", "PDF version", "CreationDate", "Subject", and "Keywords".
Of these, usually "PDF version" and "CreationDate" are not interesting
to the user.  The others are often blank or some default such as
"Microsoft Word - foobar" or "untitled document" or "foobar.doc".  The
keywords defined by the author, if any, are often not the keywords the
user feels should be associated with the paper.

And metadata has been standardized for years -- see
http://www.loc.gov/marc/ or http://dublincore.org/.  The problem is
that the standardization efforts are targeted at publishers and
librarians doing generic cataloguing, not to individuals with
individual interests managing their personal libraries.  I think a
good effort to begin with would be to allow people to tag both PDF and
XHTML with the tags in a BibTeX entry for a document.

Finally, Apple, at least, *does* in fact build in access in the
filesystem to metadata in all kinds of files, including PDFs, via
their Spotlight system.  From www.apple.com:

"When you search via Spotlight, you're actually accessing a
comprehensive, constantly updated index that sees all the metadata
inside supported files -- the "what, when and who" of every piece of
information saved on your Mac -- including the kind of content, the
author, edit history, format, size and many more details. Most
documents, including Microsoft Word documents, Photoshop images and
emails, already contain rich metadata. And because Spotlight indexes
content as well, your search results include what appears inside a
file or document, not just its title."

But most of that metadata is either useless or malformed.  And the
other major problem with Spotlight is that it searches *everything*,
not just a carefully user-selected collection of content.  The
signal-to-noise ratio is still astonishingly low.

Bill