XMP in PDF Metadata Dictionaries and Streams
The PDF Reference from Adobe, Version 1.3, for PDF 1.4, specifies two methods for specifying metadata in PDF files: as a dictionary, and as a metadata stream.
Metadata can be stored in a PDF document in either of the following ways:
 In the document information dictionary within the PDF file
 In a metadata steam (PDF 1.4) within the PDF file and associated with the document or a component of the document
Document Information Dictionary
The optional Info entry in the trailer of a PDF file can hold a document information dictionary containing metadata for the document; Table 1.1 shows its contents. Any entry whose value is not known should be omitted from the dictionary, rather than included with an empty string as its value.
Some applications, like SearchPDF, permit searches on the contents of the document information dictionary. To facilitate browsing and editing, all keys in the dictionary are fully spelled out, not abbreviated. New keys (field names) should be chosen with care so that they make sense to users.
Table 1.1 Entries in a PDF document information dictionary
KEY
|
TYPE
|
VALUE
|
Title
|
text string
|
(Optional; PDF 1.1) The document's title.
|
Author
|
text string
|
(Optional) The name of the person who created the document.
|
Subject
|
text string
|
(Optional; PDF 1.1) The subject of the document.
|
Keywords
|
text string
|
(Optional; PDF 1.1) Keywords associated with the document.
|
Creator
|
text string
|
(Optional) If the document was converted to PDF from another format, the name of the application (for example, Adobe FrameMaker) that created the original document from which it was converted.
|
Producer
|
text string
|
(Optional) If the document was converted to PDF from another format, the name of the application (for example, Acrobat Distiller) that converted it to PDF.
|
CreationDate
|
date
|
(Optional) The date the document was created, in human-readable form.
|
ModDate
|
date
|
(Optional; PDF 1.1) The date the document was last modified, in human-readable form.
|
Trapped
|
name
|
(Optional; PDF 1.3) A name object indicating whether the document has been modified to include trapping information.
|
Example 1.1 shows a typical document information dictionary.
Example 1.1
1 0 obj
<< /Title (PostScript Language Reference, Third Edition)
/Author (Adobe Systems Incorporated)
/Creator (Adobe FrameMaker 5.5.3 for Power Macintosh)
/Producer (Acrobat Distiller 3.01 for Power Macintosh)
/CreationDate (D:19970915110347-08'00')
/ModDate (D:19990209153925-08'00')
>>
endobj
Metadata Streams
Metadata, both for an entire document and for components within a document, can be stored in PDF streams called metadata streams (PDF 1.4). The advantages of metadata streams over the document information dictionary include the following:
 PDF-based workflows often embed metadata-bearing artwork as components within larger documents. Metadata streams provide a standard way of preserving the metadata of these components for examination downstream. PDF-aware applications should be able to derive a list of all metadata-bearing document components from the PDF document itself.
 PDF documents are often made available on the World Wide Web or in other environments, where many tools routinely examine, catalog, and classify documents. These tools should be able to understand the self-contained description of the document even if they do not understand PDF.
Besides the usual entries common to all stream dictionaries, the metadata stream dictionary contains the additional entries listed in Table 1.2.
The contents of a metadata stream is the metadata represented in Extensible Markup Language (XML). This information will be visible as plain text to tools that are not PDF-aware only if the metadata stream is both unfiltered and unencrypted.
Table 1.2 Additional entries in a metadata stream dictionary
KEY
|
TYPE
|
VALUE
|
Type
|
name
|
(Required) The type of PDF object that this dictionary describes; must be Metadata for a metadata stream.
|
Subtype
|
name
|
(Required) The type of metadata stream that this dictionary describes; must be XML.
|
The format of the XML representing the metadata is defined as part of an XML metadata framework specified by Adobe. This framework provides a way to use XML to represent metadata describing documents and their components, and is intended to be adopted by a wider class of applications than just those that process PDF. It includes a method to embed XML data within non-XML data files in a platform-independent format that can be easily located and accessed by simple scanning rather than requiring the document file to be parsed. A metadata stream can be attached to a document through the Metadata entry in the document catalog.
PDFMetamaker writes metadata to both Dictionaries and Streams
The PDFMetamaker application writes collected document metadata into both the DocInfo dictionary of a PDF file and into the metadata Stream as XML Packet data.
SearchPDF can index metadata from either Dictionaries or Streams
SearchPDF today can index the metadata in the DocInfo dictionary. In a future upgrade it will index the metadata from either the DocInfo dictionary or from the XML Stream.
How Metadata is stored in PDF files
Metadata is stored in in the DocInfo dictionary of the PDF file, and is also optionally as XMP-compliant XML metadata packets. If metadata is also stored as XML packets, then the PDF file must be saved in the PDF Version 1.4 format (earlier versions of PDF do not support XML data storage).
The DocInfo dictionary contains the standard fields Title, Author, Subject and Keywords that are displayed in the the Acrobat Reader. The DocInfo dictionary is also able to contain custom-defined fields that are part of the metadata schema for a document collection.
XMP Schema Definition -- Is XMP just Adobe's metadata schema? "
A schema defines the structure of information records. Typically it is expressed as a set of properties with an associated type. For example an informational schema description of the schema of a customer database would be something on the order of: (1) Name: string of up to 80 characters (2) Customer ID: number of up to 10 digits (3) Orders: a list of Order records. XMP is not Adobe's metadata schema. Rather, XMP is an extensible framework built onto RDF. That can be used to represent any number of schemas, some of which are standards such as Dublin Core , others which Adobe recommends such as schemas for asset management and some that can be defined and used by customers or specific industry segments for their own specific needs.
What file formats does XMP support?
XMP is designed to work with any file format. Using the XMP toolkit it is possible to extract XML packets containing XMP metadata from any file, even if the file format is unknown. Further, it might be possible to modify in place existing XML packets if the file format allows it. Adobe will be providing guidelines on how to add XML packets to existing publicly documented file formats that support extensibility such as JPEF, GIF, TIFF, PNG, HTML, XML, and SVG.
What is a Schema Description Language (SDL)?
What is a Schema Description Language (SDL)? A Schema Description Language commonly referred to as SDL is a machine readable description of a schema. XMLSchema, RDFSchema and DTD's are schema description languages. XMP doesn't support a specific SDL at the moment. As the technology matures, we will support XMLSchemas or RDFSchemas as a way to describe in a machine readable way the schema used.
Why not use a simple DTD or XMLSchema description? Why not use a simple DTD or XMLSchema description? The simple answer is because they are not sufficient frameworks. For example, lets say that two different schemas need to be used. One to represent basic information about a document, such as keywords and another a list of people who have approved the document. The basic data structures of RDF, in this case a bag, can be used in both cases. Without RDF, the data structure and how to represent it in XML would have to be described in each schema, potentially in a different way making the metadata more difficult to process.
Can files containing XML packets be used by applications that are not XMP aware? Can files containing XML packets be used by applications that are not XMP aware? Yes, For example a JPEG file containing an XML packet can be displayed on a standard web browser.
Does XMP require RDF to be supported in it's entirety? Does XMP require RDF to be supported in it's entirety? No. XMP only requires a subset of RDF that is appropriate to represent metadata. For example, reification, the possibility to express statements about statements such as "John Q. Public believes that Jane Doe is the author of this document" is not required.
What is an XML packet? What is an XML packet? An XML packet is a way to embed arbitrary XML fragments into another file, whether that file is a binary format such as JPEG or TIFF, or a text format such as HTML or SVG.
|