Re: Why proofed and formatted digital text?
- From: Bowerbird@[redacted]
- Subject: Re: Why proofed and formatted digital text?
- Date: Tue, 21 Feb 2006 16:57:46 EST
john ockerbloom said:
> The exchange above is starting to look like
> an argument over degrees.
i don't think so.
my end-state goal is to work diligently toward _perfect_text_.
and my e-book will be formatted _very_much_ like a p-book.
(it doesn't have to look _exactly_ like a p-book, because a lot of
conventions developed there were intended to _save_paper_ and
we don't have that constriction when we're working with e-books;
most specifically, horizontal space is cheap, so we should use it.
but without reservation i aim at the highest quality of typography,
something that is almost entirely indistinguishable from p-books.)
> Right now, there are some projects that have online books
> whose text has not been proofread by humans at all before posting.
> Will that become the norm? Would it be a good thing if it did?
it all depends. how accurate is that text? i have always been _very_clear_
that my rate of "acceptable error" for posting a text is 1-error-in-10-pages,
because a rate that low can be turned over to "continuous proofreading",
which is my system allowing the general public to help in error-correction.
i think it's altogether proper to involve the public in this fashion, since
they are the beneficiaries of the posting. the public will take the text to
perfect.
notice that i believe that a "reasonable" amount of time spent proofing and
formatting -- which i define as less time than it took to scan the book and
process the images (e.g., deskew and crop) and do the actual o.c.r. itself --
can drive the error-rate to something far better than 1-error-in-10-pages
for the majority of books, something far closer to 1-error-in-_50_-pages,
or less, but i think 1-error-in-10 pages covers the vast majority of books.. .
now, if you're talking about some of the projects i have seen, where they
have 10-20 errors on a page, then it's obvious i think that's _totally_
unacceptable.
those project managers should be absolutely embarrassed by that kind of text.
as another point of reference, the project manager of google book search said
recently that they are now obtaining an average error-rate of 1-error-per-page.
i believe their text-savvy researchers will clean that up considerably in the
future, but it probably serves the indexing purpose for which google is mainly
aiming...
> If the answer to either question is "no", how much human involvement
> would be appropriate or necessary for preparing an online edition?
well, i believe you've got the equation reversed. we should set a standard,
and allocate as much "involvement" and "time" as we might need to reach it,
rather than to specify a certain amount of "involvement" no matter what...
i might add that a good deal of the time spent on "proofing and formatting"
can be eliminated by spending more time experimenting with earlier steps:
1. scanning at higher resolutions _might_well_ increase the o.c.r. accuracy.
2. wise workflows (e.g., on things like naming conventions) _will_ save time.
3. doing sharp quality-control on the scanning _will_ save gobs of time later.
4. cleaning up scans (deskew and crop) _will_ increase the o.c.r. accuracy.
5. experimenting with various o.c.r. options _will_ increase the o.c.r.
accuracy.
6. "double-keying" with separate o.c.r. programs, and comparing their output
might prove to be a _tremendous_ time-saver when it comes to the proofing.
7. options correctly chosen in the o.c.r. program _will_ save time formatting.
the digitization process is only as strong as its weakest link. and one weak
link makes links further down in the chain even weaker than they will be
otherwise.
_far_ too many people think that "proofing" happens after the o.c.r. plops out,
and pay insufficient attention to the earlier steps in the process, and then
they complain because "the proofing is so hard". if you do everything right,
it's easy.
other people want to factor in the time and energy of applying heavy markup,
because they have convinced themselves that it is a necessary component, but
they're wrong. a dirt-simple plain-text "markup" is _all_ that is required
for an _extremely_powerful_ e-book experience for many books, perhaps even
_most_, depending what types of books you want. examples coming soon will
prove it...
-bowerbird