XML: Document and Information Management

by Todd Freter,
Program Manager Sun Microsystems
A classic article

This article was part of a series of articles about Extensible Markup Language published in 2000 that explored XML's origins, some early potential uses in applications, and its relationship to HTML, the current markup language for documents published on the Web. Those articles are at Sun.com. This article has been lightly edited from the original version (Copyright 2000, Sun Microsystems, Inc.).

The attention paid to XML (Extensible Markup Language), whose 1.0 standard was published February 10, 1998, is impressive. XML has been heralded as the next important internet technology, the next step following HTML, and the natural and worthy companion to the Java programming language itself. Enterprises of all stripes have rapturously embraced XML. An important role for XML is in managing not only documents but also the information components on which documents are based.

Document Management: Organizing Files

Document management as a technology and a discipline has traditionally augmented the capabilities of a computer's file system. By enabling users to characterize their documents, which are usually stored in files, document management systems enable users to store, retrieve, and use their documents more easily and powerfully than they can do within the file system itself.

Long before anyone thought of XML, document management systems were originally developed to help law offices maintain better control over and access to the many documents that legal professionals generate. The basic mechanisms of the first document management systems performed, among others, these simple but powerful tasks:

  • Add information about a document to the file that contains the document
  • Organize the user-supplied information in a database
  • Create information about the relationships between different documents

In essence, document management systems created libraries of documents in a computer system or a network. The document library contained a "card catalog" where the user-supplied information was stored and through which users could find out about the documents and access them. The card catalog was a database that captured information about a document, such as these:

  • Author: who wrote or contributed to the document
  • Main topics: what subjects are covered in the document
  • Origination date: when was it started
  • Completion date: when was it finished
  • Related documents: what other documents are relevant to this document
  • Associated applications: what programs are used to process the document
  • Case: to which legal case (or other business process) is the document related

Armed with a database of such information about documents, users could find information in more sensible and intuitive ways than scanning different directories' lists of contents, hoping that a file's name might reveal what the file contained. Many people consider document management systems' first achievement to have created "a file system within the file system."

Soon, document management systems began to provide additional and valuable functionality. By enriching the databases of information about the documents (the metadata), these systems provided these capabilities:

  • Version tracking: see how a document evolves over time
  • Document sharing: see in what business processes the document is used and re-used
  • Electronic review: enable users to add their comments to a document without actually changing the document itself
  • Document security: refine the different types of access that different users need to the document
  • Publishing management: control the delivery of documents to different publishing process queues
  • Workflow integration: associate the different stages of a document's life-cycle with people and projects with schedules

These critical capabilities (among others) of document management systems have proven enormously successful, fueling a multi-billion dollar business.

XML: Managing Document Components

XML and its parent technology, SGML (Standard Generalized Markup Language), provide the foundation for managing not only documents but also the information components of which the documents are composed. This is due to some notable characteristics of XML data.

Documents vs. Files

In XML, documents can be seen independently of files. One document can comprise many files, or one file can contain many documents. This is the distinction between the physical and logical structure of information. XML data is primarily described by its logical structure. In a logical structure, principal interest is placed on what the pieces of information are and how they relate to each other, and secondary interest is placed on the physical items that constitute the information.

Rather than relying on file headers and other system-specific characteristics of a file as the primary means for understanding and managing information, XML relies on the markup in the data itself. A chapter in a document is not a chapter because it resides in a file called chapter1.doc but because the chapter's content is contained in the <chapter> and </chapter> element tags.

Because elements in XML can have attributes, the components of a document can be extensively self-descriptive. For example, in XML you can learn a lot about the chapter without actually reading it if the chapter's markup is rich in attributes, as in <chapter language="English" subject="colonial economics" revision_date="19980623" author="Joan X. Pringle" thesis_advisor="Ramona Winkelhoff">. When the elements carry self-describing metadata with them, systems that understand XML syntax can operate on those elements in useful ways, just like a traditional document management system can. But there is a major difference.

Information vs. Documents

XML markup provides metadata for all components of a document, not merely the object that contains the document itself. This makes the pieces of information that constitute a document just as manageable as the fields of a record in a database. Because XML data follows syntactic rules for well-formedness and proper containment of elements, document management systems that can correctly read and parse XML data can apply the functions of document management system, such as those mentioned above, to any and all information components inside the document.

The focus on information rather than documents from XML offers some important capabilities:

  • Reuse of information
  • While standard document management systems do offer some measure of information reuse through file sharing, information management systems based on XML or SGML enable people to share pieces of common information without storing the piece of information in multiple places.

  • Information harvesting
  • By enabling people to focus on information components that make up documents rather than on the documents themselves, these systems can identify and capture useful information components that have ongoing value "buried" inside documents whose value as documents is limited. That is, a particular document may be useful only for a short time, but chunks of information inside that document may be reusable and valuable for a longer period.

  • Fine-granularity text-management applications
  • Because the information components in XML documents are identifiable, manipulatable, and manageable, XML information management technology can support real economies in applications such as translation of technical manuals.

Evaluating Product Offerings

While the general world of document management and information management is moving toward adoption of structured information and use of XML and SGML, some product offerings distinguish themselves by using underlying database management products with native support for object-oriented data. Object-oriented data matches the structure of XML data quite well and database systems that comprehend object-oriented data adapt well to the tasks of managing XML information.

By contrast, other information management products that comprehend XML or SGML data use relational database systems and provide their own object-oriented extensions to those database systems in order to comprehend object-oriented data such as XML or SGML data, and relying on such implementations have also garnered success and respect in the document management marketplace.


About the Author

Todd Freter is Program Manager, Advanced Development and Industry Initiatives, Java Web Services, Sun Microsystems, Inc. He has over 23 years experience in the software business. His role in XML evangelism began with his work at Novell where he worked on the team that used XML's predecessor, SGML, to deliver one of the first wide-distribution online documentation disks in the software industry, a documentation CD for NetWare 4. At Sun, Freter has managed programs using SGML and XML to publish large amounts of technology information and documentation on the Internet, and from there he moved into wider XML evangelism and standardization efforts, including the use of XML for B2B transports (ebXML) in a variety of vertical industries. Freter was an early online publisher of information about XML starting in 1998, discussing many of its possible applications, many of which have been fully realized today.


Citation

Freter, T., "XML: Document and Information Management", DSSResources.COM, 09/03/2004.


Sun Microsystems, Inc. has provided permission to archive and feature this article, including some minor revisions to the original approved by the author, at DSSResources.COM on September 3, 2004. This article was posted at DSSResources.COM on September 3, 2004.