Book People Archive

Re: PDF, DRM, and "open" formats, part 1



There seems to be a little confusion in this thread about what exactly you
would put into a pdf file of a book. I shall describe two kinds, one
definitely archival, and the other definitely not.

I find it very useful to put all the tiff and jpg scans of a book, every
page, including even the publisher's adverts often found at the back of the
book, into a pdf. FineReader does a very good and very quick job of
straightening scans, and there is good software that will crop the pages
nicely, while still leaving the printer's marks at the bottom of every
sixteenth page. Thus it is easy to produce a pdf which is as exact a
representation of the book as you will need. During the subsequent OCR and
editing process you can refer to the pdf: you do not need to use the
original book. Currently I have no less than ten books in my laptop waiting
to be edited and transcribed into xhtml. I can do this anywhere I happen to
be, for instance I'll be in Budapest in Central Europe this weekend.

I believe that the above kind of pdf is a true archive, and is valuable as
such.

On the other hand some people like to make their final transcription of a
book into a pdf. There are cases where the text of a book is so mixed up
with footnotes, tables, and other illustrations, that the pdf is what you
might have to resort to, though I would always try to make an xhtml file if
at all possible. I can agree here with Bowerbird that this way of using the
pdf format is not archival. You wouldn't go to it in a case of doubt to see
what the original book really said.

The first-mentioned type of pdf file can be processed by DjVu software to
produce a DjVu file. Now this also cannot be regarded as archival, because
the words are transcribed into what they possibly look like, rather into
than what they are.

Best regards to all, Nick Hodson