Intelligent Documents for the Web
The Situation Today
While HTML files are native to the web and web browsers, other file formats are not. A few file formats besides HTML can be displayed by the web browser natively. Others require web browser plug-ins.
Each format differs in behavior on the web due to differences in both the architecture of the file format and in the way it is managed inside the web browser.
In this mixed and evolving environment, SearchPDF is putting forward operational standards for what constitutes "intelligent documents" for the web.
Basic Ingredients of an "Intelligent Document"
For our purposes, an intelligent document has a full-text layer that can be search-indexed and search upon, with a hit-term highlighting capability.
An intelligent document also can store index data (metadata) within the file itself. To be "intelligent" is also to be "portable".
Five Intelligent Document Types for use with PDF MetaMaker
PDF, HTML, XML, DjVu
We have started with PDF. PDF has a full-text layer, which can either be "exposed" (scalable text is visible) or "hidden" (text is hidden behind an image of the page). PDF only has 4 standard index fields in a section of the PDF file called the DocInfo section. We have developed the ability to store and use an unlimited number of index fields in PDF, so we have overcome this limitation.
HTML and XML can store any number of index fields in Meta Tags.
SearchPDF provides support for PDF, HTML and XML.
Of these three formats, PDF is currently the most full-featured for search and retrieval, because the PDF viewer plugin has built-in support for hit-term highlighting and hit-page navigation. Also, almost any type of document can be converted to PDF in order to pick up the benefits of search and retrieval on the web.
DjVu and Mr. Sid Support Under Development
Lizardtech is continuing to develop the DjVu and Mr. Sid formats. DjVu will sport a hidden-text layer and a metadata storage structure in a future release. Mr. Sid will sport a similar metadata storage structure. DjVu and Mr. Sid will then find a place alongside PDF, HTML and XML on the "intelligent document list" of PDF WebSearch.
DjVu in particular is well suited for Image + Hidden Text documents, because it delivers far-better compression and resulting file sizes than PDF. The smaller size is better suited to todays bandwidth standards (56K modems).
PDF WebSearch Intelligent Document Strategy
Our strategy is to create digital document collections consisting of the document types we deem to be "intelligent". Other document types (such as MS Office documents) are accomodated by conversion to an "intelligent type" (e.g. PDF).
|