Re: another year here on bookpeople
- From: Perry Willett <pwillett@[redacted]>
- Subject: Re: another year here on bookpeople
- Date: Thu, 28 Dec 2006 11:33:42 -0500 (EST)
Bookpeople readers have been subjected to Bowerbird's name-calling, faulty
logic and self-importance for some time, but he has sunk beneath his usual
low standards in his latest attack on me and the University of Michigan. I
hope he resolves in the New Year to stick to the issues and drop the ad
hominem drivel. I'm grateful to John for his efforts in moderating this
list.
As expected, Bowerbird leaps to unfounded conclusions. For instance, I
haven't "scurried" anywhere--I have chosen not to respond to his earlier
posts, not seeing anything needing my response. And how courageous of
him to attack me on a list when he thinks I'm not reading it.
He has also "proved" through sheer inference that Michigan is doing
something to the text files for the Google project, but he's sadly
mistaken. I'll state again that we get text files directly from Google
and put them online without any further processing. We have a flowchart
of our entire process at
<http://www.lib.umich.edu/mdp/MDP_Workflow_Chart_final.png>
Bowerbird goes on to disparage the rest of the digital library at the
University of Michigan, primarily for the poor quality of the OCR. This is
what moves me to respond with a few words about our goals. This is
unlikely to mollify Bowerbird (and I'm sure we'll hear of his
dissatisfaction at great length), but other readers of this list might be
interested.
We OCR'd over 5 million pages in FY05/06 beyond what Google provided to
us. We do OCR for 2 reasons: 1) it greatly improves access for keyword and
phrase searching, and 2) we can do it cheaply. We make no effort to
correct it, obviously, or to even to optimize the OCR for any particular
volume. We send every volume through the process using standard settings.
We use PrimeRecognition, which harnesses 6 different OCR software engines.
You can read more about PrimeRecognition at <http://www.primerec.com>.
PrimeRecognition runs each page past each of the 6 OCR engines, and
conducts a voting scheme. This improves the accuracy marginally, and OCR
is all about marginal improvements in accuracy. This software is fairly
expensive, but given our volume, the per-page costs are very low.
Bowerbird points to examples of poor OCR from the Making of America
collection. I could point to other examples with much better quality. We
do not have statistics on our overall OCR accuracy. A U-M student did a
study in 1998 on OCR accuracy in the Making of America, published here:
<http://www.hti.umich.edu/m/moagrp/moaocr.html>. We show the OCR no matter
how bad it is, because we don't have anything to hide.
It's true that we have not optimized our digital library systems for those
people who wish to correct our OCR. There's no good reason why we haven't
done this, other than time, resources and priorities. We've had several
key vacancies in the past year that have slowed down our software
development efforts. At full staffing, however, we're a fairly small
organization, and even with the best of intentions we cannot move as
quickly as we'd like. We have a long list of features and functionality
we'd like to implement.
We are talking within the library about new features for MBooks (our
Google content) that may please some readers of this list. I won't go into
detail at this point, but I should have more to say in the coming months.
Best wishes for the New Year,
Perry Willett
Head, Digital Library Production Service
300 Hatcher North
University of Michigan
Ann Arbor MI 48109-1205
Ph: 734-764-8074
Fax: 734-647-6897
Email: pwillett@[redacted]