Searchable PDF - OCR Recognition Accuracy
September 28, 2003
Introduction
This comparison is inspired by an interchange that occured in the PDF Listserv over the past weekend. The transcript of this interchange follows in this study.
In the interchange, Ari from cVisionTech used a sample PDF file (an Annual Report) available from the Adobe website to compare recognition accuracy between Acrobat 6 Paper Capture and the ScanSoft OCR engine used in cVisionTech PDF Compressor.
We now have extended this comparison by adding the OCR recognition results from the ABBYY FineReader OCR engine used in JRAPublish, and the Expervision OCR engine used in LizardTech Document Express 4.0.
The results:
Number of Words found
Search Word
|
Acrobat 6 Paper Capture Plugin
|
Document Express 4.0 (Expervision OCR Engine)
|
cVision PDF
Compressor
(Scansoft OCR Engine)
|
JRAPublish (ABBYY FineReader OCR Engine)
|
commission
|
50
|
87
|
169
|
169
|
section
|
4
|
13
|
19
|
19
|
recall
|
31
|
55
|
58
|
60
|
requirements
|
18
|
21
|
31
|
31
|
corrective
|
13
|
16
|
23
|
24
|
regulations
|
11
|
11
|
18
|
18
|
TOTAL
|
127
|
203
|
318
|
321
|
ACCURACY %
|
40%
|
63%
|
99%
|
100%
|
The document, an Annual Report scanned at 300 dpi, is an example of a clean scanned document. Based on our visual inspection of the pages, the ABBYY FineReader OCR Engine achieved 100% recognition accuracy, while the Scansoft OCR Engine was just 3 words short of perfect. The Expervision OCR engine and Acrobat Paper Capture Plugin were both considerably less accurate.
Searchable PDF
(from the PDF discussion list, 9-27-2003)
Post: I'm looking for a reliable, accurate and inexpensive software that will convert scanned images to searchable PDF files.
Reply: Do you have a bunch of previously scanned documents you need to do or are you looking to do it as you scan in documents in the future? Is this an occasional task on up to 20 or so pages or do you have reams of information that needs this action?
Acrobat 5.0 has a Paper Capture plug-in that will OCR a scanned image
saved as a PDF and does a decent job. You have to open each PDF and run Paper Capture on it. We're using it in conjunction with the scanning capabilities from our Canon copier/printer to scan in and OCR contracts and some other documents.
Reply: Acrobat 6.0 (Standard) or Acrobat 6.0 (Professional) will do that. Paper Capture (OCR) is built into both. You can create the PDF directly from the scanner into PDF using the "create PDF function" or convert the tiff scan to PDF using Acrobat - and then OCR using Acrobat 6.0 which will create searchable text within the PDF file. The Image Exact option for OCR in Acrobat is particularly nice, since it adds searchable text to the PDF file but does not alter the integrity of the scanned image so that a paper print of the PDF scanned image is true to the original scanned image.
Reply from Leonard - PDFSages: You should consider an upgrade to Acrobat 6, where Paper Capture is now MUCH better, faster and can be automated in batch.
Reply from Ari - cVisionTech: Acrobat 6 Paper Capture is not reliable, nor accurate, and runs very slow. Its accuracy is in now way comparable to either Scansoft's OmniPage Pro 12 or ABBYY FineReader 6.0. I've seen it fail to process in Paper Capture mode some very standard TIFF files. It also runs very slow. If you're serious about producing searchable image PDFs, check out PdfCompressor 2.1 http://www.cvisiontech.com/cvistapdf.html
Reply from Leonard - PDFSages: First, I only said it was better than 5.0. However, under certain circumstances, it is EXACTLY the same as FineReader 6 since that's the engine being used! Paper Capture uses multiple engines based on internally determined criteria (language, quality, platform, etc.) - one of those engines is the ABBYY FineReader one.
Question to Leonard - PDFSages: How did you find out what OCR engine is running beneath Adobe Acrobat 6.0? I am interested in the various Document Capture technology products but I can never find out what engines they are running? Is there a source of information that I am missing?
Reply from Max - cVisionTech: OK, so it is same as FineReader for Asian languages (Fine32.dll is in the Asian subfolder of PaperCapture)? That does not help us non-asians too much. As you said, it is better than 5.0 ;)
Reply from Leonard - PDFSages:
> How did find out what OCR engine is running beneath Adobe Acrobat 6.0?
I looked ;). Then I asked...
Open up the PaperCapture plugin (or subfolders of the plugin
folder) and you'll see various DLLs with informative names...
> I am interested in the various Document Capture technology
>products but I can never find out what engines they are running? Is
>there a source of information that I am missing?
You ask the company...
Reply from Ari - cVisionTech:
>If Acrobat Paper Capture uses FineReader 6, why is it that given the >SAME document Abbyy will get 25 hits on a given word search and Paper >Capture will only get 18 hits?
We've looked at thousands of documents and, with statistical consistancy, Acrobat's OCR underperforms both ScanSoft OmniPage Pro 12 and ABBYY FineReader by very wide margins. Acrobat's Paper Capture results are not at all up to par with either Scansoft or FineReader, so it seems highly problematic that FineReader is actually bundled into Paper Capture.
PdfCompressor 2.1 uses ScanSoft's OCR engine, so let's conduct a simple test using these two systems. Here's a comparison of some keyword hits for the same document (the file is posted
Running Acrobat 6's Paper Capture vs. PdfCompressor 2.1, we have the
following hit results:
# of keyword hits
keyword Acrobat 6 CVISION
Paper Capture PdfCompressor 2.1
commission 50 169
section 4 19
recall 31 58
requirements 18 31
corrective 13 23
regulations 11 18
Reply from Leonard - PDFSages:
>If Acrobat Paper Capture uses FineReader 6,
I didn't say it used it ALL the time, I said it uses it under
certain circumstances. For example, as Max (who works for you ;)
noted, it is used for non-Roman OCR on Windows. On Mac OS, it's used
more often but not 100% there either...
>We've looked at thousands of documents and, with statistical consistancy, Acrobat's OCR underperforms both ScanSoft's OmniPage Pro 12 and ABBYY FineReader by very wide margins.
I don't disagree with you! Both of those are MUCH better
professional OCR solutions - and I would recommend them in a
heartbeat over using Capture for serious OCR.
But Capture works nicely for what it is designed for - the
occasional OCR job on a received fax or the like! It is NOT
designed to replace the dedicated tools...
|