Re: Why proofed and formatted digital text?
- From: Jon Noring <jon@[redacted]>
- Subject: Re: Why proofed and formatted digital text?
- Date: Wed, 15 Feb 2006 20:44:17 MST
Bill Janssen wrote:
> Jon Noring asks:
>> "What are the advantages of proofed and formatted digital texts
>> (PFDT) over page scans and associated unproofed, raw OCR text?"
>>
>> And restating the question again:
>>
>> "Are the PFDTs produced by Project Gutenberg and Distributed
>> Proofreaders wasted effort? Should these projects instead focus on
>> acquiring page scans and implementing online access tools for the
>> scans (such as using the "reflowable" presentation method which
>> Bill Janssen worked on at Xerox PARC)?"
>>
>> Mind you, I fall squarely in the "producing high-quality PFDT is a
>> very good and worthwhile thing to do" camp. So I'm sort of playing
>> Devil's Advocate here in my line of questioning.
> I think I reject the dilemma Jon is posing here. We think of the
> work we're doing with UpLib (http://www.parc.com/janssen/pubs/TR-03-16.pdf)
> as indeed being all about "proofed and formatted digital texts"
> (PFDT). The paper "Document Icons and Page Thumbnails: Issues in
> Construction of Document Thumbnails for Page-Image Digital Libraries"
> (http://www.parc.com/janssen/pubs/TR-04-11.pdf) covers some of the
> formatting we do. We take a variety of approaches to proofing OCR
> results, both automatic (using both linguistic and statistical
> approaches), and manual with GUI tools. I've got a paper in
> preparation about some of these. Folks on this list might also be
> interested in ScanScribe
> (http://www2.parc.com/istl/groups/pda/scanscribe/), an image editing
> system designed for document pages.
Well, I've already defined PFDT as meaning encoded text which is both
highly proofed and formatted (structured), and gave examples to
explain what I meant by that.
Definitely a lot of proofing of raw OCR can be done by machine.
Bowerbird has addressed this. Groups like PG and DP have built tools
to assist with searching for OCR errors. (And Bill is talking about
tools as well.)
Yet, even here there's still a significant number of errors that
require human proofing to find, at least to get an encoded text (e.g.,
Unicode) which has a quite low error rate acceptable for direct
reading by even the most discerning audience -- like my wife who can't
stand to see one error in a book! (Bill does note the "manual with GUI
tools," acknowledging the need for people to do some proofing at
times.)
Then there is the issue of the structuring of texts suitable for full
repurposing and rendering, which requires determination of fairly
fine document structure. I've yet to see automatic tools that will be
able to infer to a high degree of accuracy and precision the document
structure of any text, otherwise content conversion houses (who do
PDF to highly-structured XML for major publishers) would be put out of
business. "Here's some verse, there's a blockquote, here's a level
three header, etc." I've seen the dump outs of OCR, and I am
pessimistic that automatic tools can do this reliably enough, at least
to a usable degree for high-quality repurposing of the transcribed text.
> One of my beliefs is that for many of the great, memorable books of
> our experience there's more than just text. Authors, publishers,
> illustrators, typographers, binders -- all contribute to the
> experience of a great paper book. Look at the books the British
> Library has put online. Text is the least interesting part of most of
> them. Transcriptions work for many books, but are a real loss for
> many others. A primary goal of the UbiText system Jon
> refers to above was the desire to preserve the typography and
> illustrations of Jepson's FLORA OF CALIFORNIA
> (http://ucjeps.berkeley.edu/jepson-project3.html), but still allow a
> digital PDA version of the book.
Yeah? Well tell this to a blind person. <smile/>
Anyway, I agree with Bill to a point, but for the vast majority of
texts, the content is the most important part. When in proofed,
digital text form (ala PG/DP), it is reformattable in an infinite
variety of ways, including reproduction of the original book
typography, if that is desired. Just structure in TEI or DocBook,
apply the transformations, style sheets, etc.
These days we can (and I believe should) have both: keep the original
page scans, preferably in archival quality, to preserve the
typography (and other purposes), but also have a very accurate,
properly structured and repurposeable transcription (PFDT) to go along
with the original page scans.
> UpLib isn't even the most extreme project we have in book simulation;
> check out 3Book
> (http://www2.parc.com/istl/projects/uir/pubs/items/UIR-2004-02-Card-ScalableBook.pdf),
> a 3D visualization of a codex book, which even allows us to capture
> elements of an original book's binding (3Book can read book data from
> UpLib servers, in fact).
Cool!
Hopefully you guys are talking with the Open Content Alliance (OCA):
http://www.opencontentalliance.org/
I notice Xerox is a Contributing Organization. Does this include you
guys at PARC?
The Working Groups in OCA are particularly interesting:
http://www.opencontentalliance.org/nextsteps.html
Jon Noring