Re: Google Books PDFs (was: More About Google's Deal with U of Cal)
- From: "Nick Hodson" <nicholashodson@[redacted]>
- Subject: Re: Google Books PDFs (was: More About Google's Deal with U of Cal)
- Date: Thu, 31 Aug 2006 18:13:49 +0100
JMO wrote
> Great! I'll be interested to know how well the OCR comes out.
In the test book I used as a test there were 337 pages. I wanted to test
whether FineReader was indeed reading the pages or not, seeing that it
was taking such a very long time to deal with each one. I clicked "stop"
after it had done 20 pages, so I got 21. Of these some were not worth
testing the OCR, being title page, pictures etc, but a few of them were
text. On these I did OCR and got extremely good results. There were on
average one misread per page, and every time FineReader had highlighted
the misread in blue. Of course that's just a small portion of one book,
so it does not tell us that the OCR is always excellent, but it tells us
that is _can_ be.
As to my finding the book that I found to be searchable I am afraid it
was one of many that I browsed, and at the time I did not think it was
of special significance. The index pages, the last few of the book, were
hyperlinked to the bodytext, and from this fact I was led to see that
the book was searchable.
You will realise from my location in England that I was using Anonymizer
to do this work.
I looked at a dump of the first few bytes of the pdf, and it looks
pretty much like what you get from other pdfs, but I expect an expert on
pdfs could tell more from it.
00000000 25 50 44 46 2D 31 2E 34 20 28 61 67 6C 29 0A 30 %PDF-1.4 (agl).0
00000010 30 30 30 31 20 30 20 6F 62 6A 0A 3C 3C 20 2F 54 0001 0 obj.<< /T
00000020 79 70 65 20 2F 45 78 74 47 53 74 61 74 65 20 2F ype /ExtGState /
00000030 54 52 20 2F 49 64 65 6E 74 69 74 79 20 3E 3E 0A TR /Identity >>.
00000040 65 6E 64 6F 62 6A 0A 30 30 30 30 32 20 30 20 6F endobj.00002 0 o
00000050 62 6A 0A 3C 3C 20 2F 54 79 70 65 20 2F 45 78 74 bj.<< /Type /Ext
00000060 47 53 74 61 74 65 20 2F 63 61 20 30 2E 33 20 3E GState /ca 0.3 >
00000070 3E 0A 65 6E 64 6F 62 6A 0A 30 30 30 30 33 20 30 >.endobj.00003 0
00000080 20 6F 62 6A 0A 3C 3C 20 2F 4C 65 6E 67 74 68 20 obj.<< /Length
00000090 32 38 39 32 33 20 2F 53 75 62 74 79 70 65 20 2F 28923 /Subtype /
000000A0 49 6D 61 67 65 20 2F 46 69 6C 74 65 72 20 2F 4A Image /Filter /J
000000B0 50 58 44 65 63 6F 64 65 20 2F 42 69 74 73 50 65 PXDecode /BitsPe
000000C0 72 43 6F 6D 70 6F 6E 65 6E 74 20 38 20 2F 43 6F rComponent 8 /Co
000000D0 6C 6F 72 53 70 61 63 65 20 2F 44 65 76 69 63 65 lorSpace /Device
000000E0 52 47 42 20 2F 48 65 69 67 68 74 20 37 35 30 20 RGB /Height 750
000000F0 2F 57 69 64 74 68 20 31 38 30 30 3E 3E 0A 73 74 /Width 1800>>.st
00000100 72 65 61 6D 0A 00 00 00 0C 6A 50 20 20 0D 0A 87 ream.....jP ...
00000110 0A 00 00 00 14 66 74 79 70 6A 70 32 20 00 00 00 .....ftypjp2 ...
00000120 00 6A 70 32 20 00 00 00 47 6A 70 32 68 00 00 00 .jp2 ...Gjp2h...
00000130 16 69 68 64 72 00 00 02 EE 00 00 07 08 00 03 07 .ihdr...........
00000140 07 01 00 00 00 00 0F 63 6F 6C 72 01 00 00 00 00 .......colr.....
00000150 00 10 00 00 00 1A 72 65 73 20 00 00 00 12 72 65 ......res ....re
00000160 73 64 0F 1E 80 00 0F 1E 80 00 05 05 00 00 00 00 sd..............
Nick Hodson, Athelstane, London, England, United Kingdom
[Moderator: For devotees of hexdumps, I should not that I've made some
additional . characters beyond those that were there in the dump Nick
sent me, replacing some non-ASCII characters. - JMO]