Book People Archive

Re: Why proofed and formatted digital text?




Jon Noring quoted Daniel P. B. Smith:
>>But I'm wondering what exactly you get with proofed conversion to a
>>text format, that you wouldn't get with a good page image -- assuming
>>you could use the Xerox reflowable-word-image trick if you liked, and
>>assuming you could perform searches via a linked, _unproofed_ text
>>conversion.

We've gone over the relative advantages of book page images vs.
transcriptions here a number of times.   I think Michael Hart has
posted a canonical "etexts vs. pictures of books" piece at least
once before, arguing for the benefits of transcriptions, but I can't
find the URL right now.

For me, general advantages of transcriptions (either proofed
or decent quality unproofed OCR) include:

    -- Searchable and indexable by a wider variety of search tools
        (text "embedded" in a system or image format might only
         be searchable by system-specific tools, and not by more
         general tools or other search engines)
    -- For the same reasons, easily cuttable and pastable
    -- Easy to adapt and reuse in all kinds of text-aware ways
          (which covers a *wide* range, from simple things like
           changing fonts to making audio versions to producing
           new works or analysis based on the texts)
    -- Usually more transparent formats (you can easily what they
        contain without needing specialized tools, and can modify
        and migrate as needed).  Such formats are also often more durable.
    -- Smaller size.  This is still important. Although I've
        argued that serious libraries should be able to handle
        the storage and bandwidth requirements of page images, there
        are lots of distribution and analysis scenarios where fewer
        bytes per book is a big win.
    -- On the production side, still easier to produce by a wider
        range of producers than page image ebooks,   This is important when
        considering that a lot of online texts get produced by small-scale
        operations or individuals that can't easily produce or distribute
        high-resolution page-scan ebooks.

*Proofread* transcriptions have additional advantages, including:

     -- Higher quality search results.   Though I've heard that OCR
        is quite good these days, in practice the text indexes I've
        seen from projects like Google, MOA, and Internet Archive are
        fairly dirty, and can miss a fair bit.
     -- Higher-quality text extraction for cut and paste, reformatting,
         and the various other text reuses alluded to above.  Depending
         on the improvement from proofreading, this may make the difference
         between usability and unusability in practice for these reuses.
     -- The human proofing process is also a time when formatting, anchors,
         and other markup can be added to a document, if desired.  (These
         can be automated as well, but what I've seen in widespread use
         often isn't very good at it.)  These enhancements can further
         increase an etext's usability.

I don't think there's any absolute answer to *when* it's worth producing
a transcription, proofed or not.  If I want to produce a transcription,
and I go ahead and do it on my own time, then it's worthwhile to me.  If
it's worthwhile to others, then so much the better, but if it isn't, they
can just ignore it and it's still a net gain to the world, because of its
benefit from my perspective and its neutrality at worst from the rest of
the world's perspective.  (In practice, my volunteer work *is* produced with
the hope of benefiting others, but even if it wasn't, the point still holds.)

These different perspectives also mean that different projects can fill
different niches.  What may be not worth the effort for a mass-production
commercial project might be worthwhile for a volunteer or specialized
project.  What might not be a particularly useful ebook for an academic
might be useful to a lay reader, or vice versa.  The Imperfect Ebook
in the hand may be worth more to me than The Perfect Ebook in the bush.
Or not, depending on my priorities.

For these reason, I don't find questions like "Are Project X's efforts
a waste of time?" to be useful ones, especially when you're talking
about other people's projects, and volunteer projects.  The people
involved participated because they found their efforts worthwhile.
And people that then took the time to read them or otherwise put them
to their own purposes evidently found them worthwhile too, at least
to some extent.

Now, it *is* a useful question to ask "Would doing things This Way
be a better use of time than what Project X is doing now?"  If you've
got a half-reasonable process, some folks may conclude that for them it is.
Others may conclude that for them it isn't.  Maybe even Project X itself will
decide to do things This Way.  But if not, and you've got folks that
really believe in This Way enough to commit their time or resources,
then those are the folks that can start Project Y, if you want to
organize such an effort.

And maybe, once Y gets going, X and Y can do even more useful
things working together.

Happy Valentine's Day!

John