Metadata at the Sub-Document Level
Intelligent documents, such as PDF and DjVu, can contain multiple pages.
There is intelligence to be encapsulated in metadata at the document-level, and in many cases there is also intelligence to be encapsulated at the sub-document level.
Todays tools address intelligence only at the document level. This is perhaps a translation of the way web search engines work. They search single-page HTML files, and read Meta Tag information in these files. HTML pages do not have the multipage structure that paper documents have, and which is reflected in PDF and DjVu formats.
Most web search engines do not enter PDF or DjVu files to index them at all. PDF WebSearch indexes PDF files by reading custom DocInfo fields in the PDF and treating them the same as Meta Tags in HTML.
This approach is fine for short documents, such as memos and reports, or invoices. Each of these documents is a descrete unit and each has a title which can be displayed on a search results list.
The Problem of Large Multpage Documents
But what about books and service manuals? These are larger documents that are more complex. They can be seen as collections of subdocuments. A book, for example, can have chapters, each with an individual chapter title. Would we prefer to search for chapters instead of books and see a list of chapter titles on our search results screen?
A service manual might have tabbed sections that are titled. Would we prefer to search for Tabs instead of books and see a list of tabbed titles on our search results screen?
Breaking a Large Document into Smaller Documents by Section
A straight-forward approach that has been applied in these circumstances is to break the multi-page PDF or DjVu document down into sub-documents. For example, one PDF file for each chapter. Then the PDF title field IS the chapter title. We now have one set of index fields for each chapter. However, we have broken the single document PDF file into subfiles. We might want to continue to present this PDF file for download as a single file.
The Problem when Page Breaks do not correspond with Section Breaks
Let's say that the manual on hand is for a set of different models. A model can end on a given page and the next model can start on the same page. We want to search and retrieve by model titles. Now we have a problem with the page in which one model ends and the next one starts. How do we treat this page for search and retrieval?
The Problem with Articles in Periodicals and Newspapers
Periodicals and newspapers typically move advertisements to the front of the issue and move the back-ends of articles to the back pages. The result is that you have an article which "threads" its way over multiple pages. Suppose we want to perform search and retrieval on Article Titles instead of Issue Numbers?
Now is is clearly impossible to segregate articles by dividing the document into page segments.
Threads are the Answer to these Problems
The PDF format sports a little-used feature called Article Threads. Article Threads are a series of bounding-boxes that define the regions on the pages where an Article occurs. Each bounding-box has a pointer to the next one. They are lined up in a series which correspond to the text-flow of the article. Each Article Thread in the PDF format can have its own index fields as well.
If a search engine can follow these thread-definitions and extract the full-text of the Article from the multiple pages of the PDF file, then it is possible to search upon articles, view a Results List of article titles, and open to the page in the PDF where the article begins.
PDF Article Threads as a feature was developed when the PDF file format was introduced by Adobe. The primary purpose of Article Threads at that time was for online reading. The 14" monitors of the day did not permit sufficient resolution for online reading at the "fit width" magnification level, so zooming in further (eliminating white margin space) was a practical solution. Article Threads allow this, plus the ability to continue to navigate the text-flow defined by the Article Thread from page to page.
Now that most of us have 17" or larger monitors, and display in a resolution of at least 800 x 600, we find that the "fit width" magnification level is sufficient for reading, and as a result Article Threads have less importance than they once did. We naturally prefer to view the entire page image, not just a magnified portion of it. When we view a paper page, we see it as a whole, and we want it this way on the computer screen as well.
Index Fields in PDF Article Threads
Each Article Thread in a PDF file sports the standard four index fields that are provided for the document as a whole: Title, Subject, Author and Keywords.
It is possible to display a list of Article Threads in the PDF file while using Acrobat Exchange 3.0 or Acrobat 4.x. This display box lists the Article Threads using the Title field. The other fields appear to have no practical use until you consider their use in search-indexing. Indeed, the Keywords field is a field for searching and not for labeling, so one could conclude that the architects of the PDF file format had Article Thread searching in mind from the outset. The Acrobat Catalog product was introduced without the ability to search Article Threads however, and so this function remains unavailable today.
*****
from John Warnock, President of Adobe - Seybold 99 Boston:
"What you need to seek in the future are plug-ins and utilities to extract text from article threads; to be able to re-use data; to be able to tag data inside of PDF files; to re-use the tagged information so that the document is not just the static representation of the layout, but carries a great deal of the semantics of the content with it."
Article Threads Defined:
An article thread identifies related elements in a document, enabling a user to follow a flow of information that may span multiple columns or pages.
A PDF document may include one or more article threads. Each thread has a title and indexing fields (Author, Subject, Keywords) and a list of thread elements, which are referred to as beads. A viewer may allow the user to select a particular thread and then navigate through it; the viewer automatically maintains a comfortable zoom level for reading and moves from one bead to the next, rather than from one page to the next.
If a document includes any threads, they are stored in an array as the value of the Threads key in the Catalog object. Each thread and its beads are dictionaries.
|