Broadsize Newspaper Comparison
This is a comparison of a broadside newspaper scanned as a bitonal, 300 dpi image with automatic halftoning.
The newspaper page is rendered as a searchable-image file in three formats:
PDF Searchable Image with regular CCITT G4 compression
PDF Searchable Image with new JBIG2-PDF compression
DjVu Searchable Image
In the table below, click the size to open that file:
PDF Searchable Image - CCITT G4
|
|
PDF Searchable Image - JBIG2
|
|
DjVu - Searchable Bitonal (JB2 compression)
|
|
These files were created with the JRAPublish application.
Acrobat Reader 5.0 or above is required to view the JBIG2-PDF
DjVu Plug-In is required to view the DjVu file.
Working with text that has a shaded background
When a newspaper page has text that is placed on a shaded background, and the page is scanned as an enhanced bitonal image, then the shaded background is rendered as a set of dot patterns that simulate the shading. While this works well for simulating gray-shades when the image is viewed or printed, it interferes with OCR because the OCR engine cannot distinguish or isolate the text from the dot pattern that surrounds it.
This problem can be overcome by scanning the page in grayscale or color. Then the OCR engine can properly isolate the text from the shaded background, which now is not a dot pattern but is instead a different color or true gray-shade.
This OCR problem is exhibited in the sample files above, on the right-side of the image. Try to select the text that is in the side bar and you will see that little or no text was recognized.
Now we are presenting a second test below in which we took a section of a newspaper page with a shaded background, and we scanned it as bitonal, color, enhanced color and as grayscale. You will see that all of the text was correctly OCRed in all the versions that were not bitonal. In the bitonal version, the dot-shading prevented effective OCR.
Image Depth
|
PDF
|
DjVu
|
Bitonal
|
|
|
Color
|
|
|
Enhanced Color
|
|
|
Grayscale
|
|
|
Now in a further test, we took the bitonal scan, and proccessed with the dot-shading removal of ScanFix from TMSSequoia. This effectively removed the shading from around the text. Finally, we show the resulting bitonal PDF file with normal CCITT-G4 compression and with the new JBIG2-PDF compression.
File Type
|
PDF
|
Bitonal - with dot shading
|
|
Bitonal - dot shading removed
|
|
Bitonal - with dot shading - JBIG2
|
|
Bitonal - dot shading removed - JBIG2
|
|
|