Book People Archive

Re: Google Books



Klaus Graf wrote:

> So what? Runeberg has much more better quality than PG with its 
> often lousy unsourced editions.

Klaus, you're so wrong.  All of Project Gutenberg, Project 
Runeberg, the Internet Archive, Wikisource and Google Book Search 
contain both good and bad.  All projects are in development, 
trying new methods, trying to improve.  There's no final winner or 
loser among them.

When Project Runeberg (founded by me in 1992) made the transition 
from pure e-text to scanning facsimile images in 1998, I knew that 
this was a major improvement, but I didn't throw away all e-texts 
that we had produced in our first six years, because I knew that 
in a few years we would make other improvements, and we just can't 
keep throwing away everything old when we do something new.

Two and a half years have passed since Google Book Search was 
announced, so the project has completed the first quarter of its 
ten year plan.  The kind of article that Paul Duguid wrote in 
First Monday is almost exactly what I received from a Swedish 
literature scholar in 1996-1997 when Project Runeberg was only a 
few years into its existence.  From the eyes of a scholarly text 
editor, I had made every mistake possible.  I realized that there 
were problems, but the proposed solution of more rigid textual 
criticism was never going to work for a broad volunteer project 
with zero budget. Since I'm a programmer and not a literature 
scholar, I aimed for a solution that took advantage of 
improvements in technology (broadband was about to replace modems, 
web browsers were improving, disk storage becoming cheaper, etc.) 
rather than requiring more specialized training of humans.

It can be useful to read such criticism.  Of all the errors you 
know to have committed, the critic will only observe or emphasize 
a select few.  The critic wants to be you, but he isn't.  Some of 
the things Paul Duguid points out, should have been fixed by 
Google long ago, such as proper metadata for multi volume works. 
Other things simply miss the target.

It's an interesting point of observation that Paul Duguid found 
the 1904 edition of Tristram Shandy to belong in Stanford's 
Auxiliary library.  In a library catalog, we seldom think of shelf 
placement as an indicator of literary quality.  If I was in 
charge, either at Google or Stanford, the Auxiliary library is 
where I'd begin too.  If there is any mistake to be done, it will 
be done in the early phases of the project.  Let's do the early 
mistakes in the Auxiliary library, so that the process works to 
perfection when it is time to digitize the valuable works.

The fact that Google has marked up the list of illustrations as 
the table of contents is of course such a mistake.  But fixing 
that error doesn't mean the book has to be scanned anew.  The same 
goes for OCR errors.  If you produce good scans today, you can run 
improved OCR tomorrow.  The fact that some scans are bad is a more 
serious error.  That book needs to be scanned again.  Luckily, it 
was one from the Auxiliary library, that they can afford to pass 
through the scanner two or three times. I'm sure this book is 
already on Google's list of books that need to be rescanned.  But 
they probably have no hurry.  If they wait a few years, they can 
afford color cameras with higher resolution, rather than the 
current black and white imaging.

Some observations that Paul Duguid fails to make are:

* OCR quality is sometimes really bad in non-English works.  This 
  is the current state of OCR.  But with good scans, OCR can be 
  redone later, when algorithms and dictionaries have improved.

* We have seen some experiments in Google Book Search, where place 
  names mentioned in the book are placed on a map.  That kind of 
  experiments is where I'd expect Google to focus.  You need a 
  certain amount of scanned books before you can start, but these 
  books don't need to be perfect editions.  Perhaps that's why 
  we're seeing "bad books", because they aren't so bad for 
  experimentation in new search algorithms.

* Google's images are black and white, even though leading 
  projects now use color cameras. Perhaps Google too is using 
  color cameras, but for the moment converts images to black and 
  white for display?

* Google's books are mostly of the boring kind.  Perhaps they're 
  saving the encyclopedias and richly illustrated works for later?

* Google's books are mostly small format.  Perhaps they're saving 
  large format books for later?

An important factor is that digital cameras are now quickly 
improving, and prices are falling.  Today a 17 megapixel camera 
costs ten times more than a 10 megapixel camera.  To make really 
good reproductions of large format books, you'd rather wait a few 
years before you even buy the camera.


-- 
  Lars Aronsson (lars@[redacted]
  Project Runeberg - free Nordic literature - http://runeberg.org/