Book People Archive

Re: HP's open-source Tesseract OCR, any experience?

From: D Garcia <donovan@[redacted]>
Subject: Re: HP's open-source Tesseract OCR, any experience?
Date: Mon, 20 Mar 2006 23:05:32 EST


This was forwarded to me, I'm not on the list.
> Subject: 	[BP] HP's open-source Tesseract OCR, any experience?
> Tom Breuel pointed out to me a new project up at sourceforge, called
> "tesseract-ocr", with "lvincent" listed as admin -- presumably Luc
> Vincent (a document image processing expert now at Google).  There are
> no files there, but they do seem to be at the University of Nevada -
> Las Vegas ISRI site, at
> http://www.isri.unlv.edu/downloads/ocr-prerelease-20051201.tar.bz2,
>
> I was wondering if any adventurous explorer had tried it out yet, and
> if so, what the results were like?

It only is configured to build under MSVC++6 for Windows.
It only accepts uncompressed bitonal tiffs.
It's command-line only. No GUI.
It performed abysmally on the provided testimage.tif
But it did build. :)

Also in that directory you mentioned, there is a utility called ocrspell, 
which is crufty code that I can't get to configure properly on a modern linux 
system ... to give you an idea, it is hardcoded for ispell 3.1.08 and it's 
dependent files, and most systems are using aspell 0.50.x or 0.60.x. (ispell 
3.1.20 or higher) The (other) problem here is that the dictionaries are very 
different from what the program expects.

Granted, this was a fairly quick look, but I don't see this as being useful 
very soon without a lot of gnashing of teeth.