Book People Archive

Re: Why proofed and formatted digital text?

From: Jon Noring <jon@[redacted]>
Subject: Re: Why proofed and formatted digital text?
Date: Tue, 14 Feb 2006 14:46:22 MST

Tony posted:

> And if you post the document to Wikisource, you can hyperlink things like
> references in the text and notes at the end, using the Mediawiki software.
> You can also add links to Wikipedia articles, enabling the explanation
> of obscure terms, for example.

O.k., to put on my Devil's Advocate hat on, let's look at the
competition:

   page scans + raw OCR text + bounding box information

Bounding box information is essentially the "coordinates" of the box
containing a word on a page scan. The work Brewster Kahle is doing at
the Internet archive places all the raw OCR text along with bounding
box information for each word into an XML document.

Thus, it is possible to build XML-based links (e.g., XPointer) to
associate annotations, references, etc. pointing to the exact place in a
page scan image! That spot can even be highlighted since one knows the
coordinates of the word on the page scan.

Now, mind you, I strongly believe PFDT (proofed formatted digital
text) is worthwhile to produce for many types of books and documents,
so my comment above is simply to lay out the opposing argument so we
may improve our argument.

So what says everyone?

Jon Noring