Book People Archive

Re: another year here on bookpeople

From: Perry Willett <pwillett@[redacted]>
Subject: Re: another year here on bookpeople
Date: Thu, 28 Dec 2006 11:33:42 -0500 (EST)
Bookpeople readers have been subjected to Bowerbird's name-calling, faulty 
logic and self-importance for some time, but he has sunk beneath his usual 
low standards in his latest attack on me and the University of Michigan. I 
hope he resolves in the New Year to stick to the issues and drop the ad 
hominem drivel. I'm grateful to John for his efforts in moderating this 
list.

As expected, Bowerbird leaps to unfounded conclusions. For instance, I 
haven't "scurried" anywhere--I have chosen not to respond to his earlier 
posts, not seeing anything needing my response. And how courageous of 
him to attack me on a list when he thinks I'm not reading it.

He has also "proved" through sheer inference that Michigan is doing 
something to the text files for the Google project, but he's sadly 
mistaken. I'll state again that we get text files directly from Google 
and put them online without any further processing. We have a flowchart 
of our entire process at 
<http://www.lib.umich.edu/mdp/MDP_Workflow_Chart_final.png>

Bowerbird goes on to disparage the rest of the digital library at the 
University of Michigan, primarily for the poor quality of the OCR. This is 
what moves me to respond with a few words about our goals. This is 
unlikely to mollify Bowerbird (and I'm sure we'll hear of his 
dissatisfaction at great length), but other readers of this list might be 
interested.

We OCR'd over 5 million pages in FY05/06 beyond what Google provided to 
us. We do OCR for 2 reasons: 1) it greatly improves access for keyword and 
phrase searching, and 2) we can do it cheaply. We make no effort to 
correct it, obviously, or to even to optimize the OCR for any particular 
volume. We send every volume through the process using standard settings.

We use PrimeRecognition, which harnesses 6 different OCR software engines. 
You can read more about PrimeRecognition at <http://www.primerec.com>. 
PrimeRecognition runs each page past each of the 6 OCR engines, and 
conducts a voting scheme. This improves the accuracy marginally, and OCR 
is all about marginal improvements in accuracy. This software is fairly 
expensive, but given our volume, the per-page costs are very low.

Bowerbird points to examples of poor OCR from the Making of America 
collection. I could point to other examples with much better quality. We 
do not have statistics on our overall OCR accuracy. A U-M student did a 
study in 1998 on OCR accuracy in the Making of America, published here:
<http://www.hti.umich.edu/m/moagrp/moaocr.html>. We show the OCR no matter 
how bad it is, because we don't have anything to hide.

It's true that we have not optimized our digital library systems for those 
people who wish to correct our OCR. There's no good reason why we haven't 
done this, other than time, resources and priorities. We've had several 
key vacancies in the past year that have slowed down our software 
development efforts. At full staffing, however, we're a fairly small 
organization, and even with the best of intentions we cannot move as 
quickly as we'd like. We have a long list of features and functionality 
we'd like to implement.

We are talking within the library about new features for MBooks (our 
Google content) that may please some readers of this list. I won't go into 
detail at this point, but I should have more to say in the coming months. 
Best wishes for the New Year,

Perry Willett
Head, Digital Library Production Service
300 Hatcher North
University of Michigan
Ann Arbor MI 48109-1205
Ph: 734-764-8074
Fax: 734-647-6897
Email: pwillett@[redacted]