Book People Archive

Re: Google Books PDFs (was: More About Google's Deal with U of Cal)



JMO wrote
> Great!  I'll be interested to know how well the OCR comes out.

In the test book I used as a test there were 337 pages. I wanted to test 
whether FineReader was indeed reading the pages or not, seeing that it 
was taking such a very long time to deal with each one. I clicked "stop" 
after it had done 20 pages, so I got 21. Of these some were not worth 
testing the OCR, being title page, pictures etc, but a few of them were 
text. On these I did OCR and got extremely good results. There were on 
average one misread per page, and every time FineReader had highlighted 
the misread in blue. Of course that's just a small portion of one book, 
so it does not tell us that the OCR is always excellent, but it tells us 
that is _can_ be.

As to my finding the book that I found to be searchable I am afraid it 
was one of many that I browsed, and at the time I did not think it was 
of special significance. The index pages, the last few of the book, were 
hyperlinked to the bodytext, and from this fact I was led to see that 
the book was searchable.

You will realise from my location in England that I was using Anonymizer 
to do this work.

I looked at a dump of the first few bytes of the pdf, and it looks 
pretty much like what you get from other pdfs, but I expect an expert on 
pdfs could tell more from it.

00000000   25 50 44 46 2D 31 2E 34  20 28 61 67 6C 29 0A 30   %PDF-1.4 (agl).0
00000010   30 30 30 31 20 30 20 6F  62 6A 0A 3C 3C 20 2F 54   0001 0 obj.<< /T
00000020   79 70 65 20 2F 45 78 74  47 53 74 61 74 65 20 2F   ype /ExtGState /
00000030   54 52 20 2F 49 64 65 6E  74 69 74 79 20 3E 3E 0A   TR /Identity >>.
00000040   65 6E 64 6F 62 6A 0A 30  30 30 30 32 20 30 20 6F   endobj.00002 0 o
00000050   62 6A 0A 3C 3C 20 2F 54  79 70 65 20 2F 45 78 74   bj.<< /Type /Ext
00000060   47 53 74 61 74 65 20 2F  63 61 20 30 2E 33 20 3E   GState /ca 0.3 >
00000070   3E 0A 65 6E 64 6F 62 6A  0A 30 30 30 30 33 20 30   >.endobj.00003 0
00000080   20 6F 62 6A 0A 3C 3C 20  2F 4C 65 6E 67 74 68 20    obj.<< /Length
00000090   32 38 39 32 33 20 2F 53  75 62 74 79 70 65 20 2F   28923 /Subtype /
000000A0   49 6D 61 67 65 20 2F 46  69 6C 74 65 72 20 2F 4A   Image /Filter /J
000000B0   50 58 44 65 63 6F 64 65  20 2F 42 69 74 73 50 65   PXDecode /BitsPe
000000C0   72 43 6F 6D 70 6F 6E 65  6E 74 20 38 20 2F 43 6F   rComponent 8 /Co
000000D0   6C 6F 72 53 70 61 63 65  20 2F 44 65 76 69 63 65   lorSpace /Device
000000E0   52 47 42 20 2F 48 65 69  67 68 74 20 37 35 30 20   RGB /Height 750
000000F0   2F 57 69 64 74 68 20 31  38 30 30 3E 3E 0A 73 74   /Width 1800>>.st
00000100   72 65 61 6D 0A 00 00 00  0C 6A 50 20 20 0D 0A 87   ream.....jP  ...
00000110   0A 00 00 00 14 66 74 79  70 6A 70 32 20 00 00 00   .....ftypjp2 ...
00000120   00 6A 70 32 20 00 00 00  47 6A 70 32 68 00 00 00   .jp2 ...Gjp2h...
00000130   16 69 68 64 72 00 00 02  EE 00 00 07 08 00 03 07   .ihdr...........
00000140   07 01 00 00 00 00 0F 63  6F 6C 72 01 00 00 00 00   .......colr.....
00000150   00 10 00 00 00 1A 72 65  73 20 00 00 00 12 72 65   ......res ....re
00000160   73 64 0F 1E 80 00 0F 1E  80 00 05 05 00 00 00 00   sd..............

Nick Hodson, Athelstane, London, England, United Kingdom

[Moderator: For devotees of hexdumps, I should not that I've made some
 additional . characters beyond those that were there in the dump Nick
 sent me, replacing some non-ASCII characters.  - JMO]