Re: Google Books
- From: Lars Aronsson <lars@[redacted]>
- Subject: Re: Google Books
- Date: Fri, 17 Aug 2007 11:13:08 +0200 (CEST)
Klaus Graf wrote:
> So what? Runeberg has much more better quality than PG with its
> often lousy unsourced editions.
Klaus, you're so wrong. All of Project Gutenberg, Project
Runeberg, the Internet Archive, Wikisource and Google Book Search
contain both good and bad. All projects are in development,
trying new methods, trying to improve. There's no final winner or
loser among them.
When Project Runeberg (founded by me in 1992) made the transition
from pure e-text to scanning facsimile images in 1998, I knew that
this was a major improvement, but I didn't throw away all e-texts
that we had produced in our first six years, because I knew that
in a few years we would make other improvements, and we just can't
keep throwing away everything old when we do something new.
Two and a half years have passed since Google Book Search was
announced, so the project has completed the first quarter of its
ten year plan. The kind of article that Paul Duguid wrote in
First Monday is almost exactly what I received from a Swedish
literature scholar in 1996-1997 when Project Runeberg was only a
few years into its existence. From the eyes of a scholarly text
editor, I had made every mistake possible. I realized that there
were problems, but the proposed solution of more rigid textual
criticism was never going to work for a broad volunteer project
with zero budget. Since I'm a programmer and not a literature
scholar, I aimed for a solution that took advantage of
improvements in technology (broadband was about to replace modems,
web browsers were improving, disk storage becoming cheaper, etc.)
rather than requiring more specialized training of humans.
It can be useful to read such criticism. Of all the errors you
know to have committed, the critic will only observe or emphasize
a select few. The critic wants to be you, but he isn't. Some of
the things Paul Duguid points out, should have been fixed by
Google long ago, such as proper metadata for multi volume works.
Other things simply miss the target.
It's an interesting point of observation that Paul Duguid found
the 1904 edition of Tristram Shandy to belong in Stanford's
Auxiliary library. In a library catalog, we seldom think of shelf
placement as an indicator of literary quality. If I was in
charge, either at Google or Stanford, the Auxiliary library is
where I'd begin too. If there is any mistake to be done, it will
be done in the early phases of the project. Let's do the early
mistakes in the Auxiliary library, so that the process works to
perfection when it is time to digitize the valuable works.
The fact that Google has marked up the list of illustrations as
the table of contents is of course such a mistake. But fixing
that error doesn't mean the book has to be scanned anew. The same
goes for OCR errors. If you produce good scans today, you can run
improved OCR tomorrow. The fact that some scans are bad is a more
serious error. That book needs to be scanned again. Luckily, it
was one from the Auxiliary library, that they can afford to pass
through the scanner two or three times. I'm sure this book is
already on Google's list of books that need to be rescanned. But
they probably have no hurry. If they wait a few years, they can
afford color cameras with higher resolution, rather than the
current black and white imaging.
Some observations that Paul Duguid fails to make are:
* OCR quality is sometimes really bad in non-English works. This
is the current state of OCR. But with good scans, OCR can be
redone later, when algorithms and dictionaries have improved.
* We have seen some experiments in Google Book Search, where place
names mentioned in the book are placed on a map. That kind of
experiments is where I'd expect Google to focus. You need a
certain amount of scanned books before you can start, but these
books don't need to be perfect editions. Perhaps that's why
we're seeing "bad books", because they aren't so bad for
experimentation in new search algorithms.
* Google's images are black and white, even though leading
projects now use color cameras. Perhaps Google too is using
color cameras, but for the moment converts images to black and
white for display?
* Google's books are mostly of the boring kind. Perhaps they're
saving the encyclopedias and richly illustrated works for later?
* Google's books are mostly small format. Perhaps they're saving
large format books for later?
An important factor is that digital cameras are now quickly
improving, and prices are falling. Today a 17 megapixel camera
costs ten times more than a 10 megapixel camera. To make really
good reproductions of large format books, you'd rather wait a few
years before you even buy the camera.
--
Lars Aronsson (lars@[redacted]
Project Runeberg - free Nordic literature - http://runeberg.org/