EXCERPT
Corporate Information Factory
William H. Inmon Ryan Sousa

The Metadata Component

The most important yet most ambiguous, most amorphous component of the Corporate Information Factory (CIF) is the metadata. From the standpoint of cohesiveness and continuity of structure across the many different components of the CIF, metadata is easily the most important component.

What Is Metadata?

A very common way of thinking about metadata is that it is data about data. An alternate way of describing metadata is that it is everything about data needed to promote its administration and use. These two widely accepted definitions of metadata, however, do not do justice to what metadata really is. A more practical approach to describing metadata by citing some examples:

  • Date layout. The customer-record layout contains the list of attributes, and their relative position and format of data on the storage media.
  • Cust-id char (15)
  • Cust-name varchar (45)
  • Cust-address varchar (45)
  • Cust-balance dec fixed (15, 2)
  • Content. There are 150,000 occurrences of transaction X in table PLK.
  • Indexes. Table XYZ has indexes on the following columns:
  • Column HYT
  • Column BFD
  • Column MJI
  • Refreshment scheduling. Table ABC is refreshed every Tuesday at 2: 00 P. M.
  • Usage. Only 2 of the 10 columns in table ABC have been used over the past six months.
  • Referential integrity. Table XYZ is related to table ABC by means of the key QWE.
  • General documentation. "Table ABC was designed in 1975 as part of the new accounts payable system. Table ABC contains accounts overdue data as calculated by . . . "

These examples of metadata only begin to scratch the surface of the possibilities. The final form will only be limited by your imagination and those needs that govern the use and administration of the CIF.

The reason why metadata is so important to the corporate information factory, and its different components, is that metadata is the glue that holds the architecture together. Figure 12.1 illustrates this role of metadata.

Without metadata, the different components of the CIF are merely standalone structures with no relationship to any other structure. It is metadata that gives the different structures— components of the architecture— an overall cohesiveness. Through metadata, one component of the architecture is able to interpret and make sense of what another component is trying to communicate.

The Conflict within Metadata

Despite all the benefits of metadata, a conflict exists: Metadata has a need to be shared, and a propensity to be managed and used, in an autonomous manner. Unfortunately, these propensities are in direct conflict with each other. Equally unfortunate is that the pull of metadata is very, very strong in both directions at the same time. Because of this conflict, metadata can be thought of as polarized, as shown in Figure 12.2.

Because the pull is so strong and so diametrically opposite, metadata is sometimes said to be schizophrenic. This leads to some fairly extreme conflicts within the CIF. As an example of the propensity of metadata to be shared, consider an architect who is building an operational database. If the operational database is to be integrated, the basic integrated, modeled design of the data needs to be shared from the data modeling environment. As the data ages and is pushed off to the data warehouse, the structure of the data needs to be shared within the data warehouse. As the data is moved into a data mart, the data once again needs to be shared. If there is to be any consistency across the corporate information factory, then metadata must be shared.

Is Centralization the Answer?

In response to the need for a central, unified source of data definition, structure, content, and use across the corporate information factory, the notion of a central repository arises (see Figure 12.3).

The central repository approach is a good solution to the needs for sharability of metadata. However, is a central repository the answer to the organization's needs?

Consider the scenario of a DSS analyst working on Excel who is deep into solving an analytical problem. At 2: 17 A. M. the DSS analyst has an epiphany and makes changes to the Lotus spreadsheet. Does the analyst need to call the data warehouse administrator and ask permission to define and analyze a new data structure? Does the DSS analyst need permission to derive a new data element on the fly? Shouldn't this new data structure and new data element be documented in the metadata repository?

Of course, the DSS analyst does not need or want anyone telling him or her what can and cannot be done at an early morning hour in the middle of a creative analysis of a problem. The DSS analyst operates in a state of autonomy and the central repository is neither welcome nor effective. The repository simply gets in the way of the DSS analyst using Excel. The DSS analyst does what is needed to bypass the repository, usually ignoring updates to the metadata in the repository.

The need for autonomy at the DSS analysis level is so overwhelming and the tools of DSS access and analysis are so powerful that a central repository does not stand a chance as a medium for total metadata management.

Is Autonomy the Answer?

If a powerful case can be made for why a central repository is not the answer, consider the opposite of the central repository where everybody "does their own thing," as shown in Figure 12.4.

In the figure, autonomy exists in that every environment and every tool has its own unique facility for interpreting metadata. Because no constraints of any type can be found anywhere in the environment, complete autonomy is in place. The autonomy suggested by the figure is pleasing to the DSS analyst in that no one or no authority is telling the DSS analyst what to do. However, the following questions arise:

  • What happens when one DSS user wants to know how data is defined and used elsewhere?
  • What happens when a DSS user wants to know what a unit of data means?
  • What happens when one DSS user wants to correlate results with another DSS analyst?
  • What happens when a DSS analyst needs to reconcile results with the source systems providing the data for analysis?
  • What happens when data is not interpreted consistently? What profit figures are correct?

The result is chaos. There simply is no uniformity or cohesiveness anywhere to be found in the autonomous environment. The purely autonomous environment is as impractical and unworkable as the central repository environment.

Achieving a Balance

Neither approach is acceptable in the long run to the architect wanting to make the corporate information factory a professionally organized environment.

In order to be successful, the CIF architect must balance the legitimate need to share metadata with the need for autonomy. Understanding the problem is the first step to a solution. In order to achieve a balance, a different perspective of metadata is required, distributed metadata.

The structure for metadata suggested in Figure 12.5 states that there must be a separation of metadata at each component in the architecture between metadata that is sharable and metadata that is autonomous. Metadata must be divided at each component of the CIF:

  • Applications
  • Operational Data Source
  • Data warehouse
  • Data mart
  • Exploration/ data mining warehouse

Furthermore, all metadata at an individual node must also fit into either a shared or autonomous category. There can be no metadata that is neither sharable nor autonomous. Likewise, metadata cannot be sharable and autonomous at the same time.

But Figure 12.5 has other ramifications. The metadata that is managed at an individual component must be accessible by tools that reside in that component. The tools may be tools of access, analysis, or development. In any case, whether the metadata is sharable or autonomous, the metadata needs to be available to and usable by the different tools that reside at the architectural component.

Another implication of the figure is that sharable metadata must be able to be replicated from one architectural component to another. Once the sharable metadata is replicated, it can be used and incorporated into the processing that occurs at other components of the corporate information factory.

Differentiating Sharable and Autonomous Metadata

The following are examples of sharable data:

  • The table name and attributes shared among applications and the data warehouse
  • Some definitions of data shared among the enterprise data model, the data warehouse, and the data mart
  • Physical attributes shared among the applications and the ODS
  • Physical attributes shared from one application to another
  • Description of how shared data is transformed as it moves through the integration and transformation (I & T) layer

Ultimately, very commonly used metadata needs to be sharable.

An example of autonomous metadata might be the indexes a table has for its use in an application. At the data mart level, the results of a given analysis might be autonomous. At the data warehouse level, the metrics of the table content may well be autonomous data. At the ODS level, the response time achieved in access and analysis of the ODS is a form of autonomous metadata. In short, many varied forms of autonomous metadata exist. In general, much more autonomous metadata can be found than shared metadata.

Defining the System of Record

In order to make the structure workable as suggested in Figure 12.5, there needs to be a clear and formal definition of the system of record (i. e., authoritative source) for shared metadata (see Figure 12.6). The system of record for shared metadata implies that each shared metadata element must be owned and maintained by only one component of the corporate information factory (such as a data warehouse, data mart, ODS, etc.). In contrast, this shared metadata can be replicated for use by all components of the corporate information factory.

For example, suppose that the definition of shared metadata is made at the data warehouse level (a very normal assumption). The implication is that the definition can be used throughout the CIF but cannot be altered anywhere except at the data warehouse. In other words, the data mart DSS analyst can use the metadata but cannot alter the metadata.

Establishing the system of record for metadata is a defining factor in the successful implementation and evolution of the CIF. Another important implication of the approach to distributed metadata is that there be an exchange of this meta object among the different architectural components (see Figure 12.7).

The sharable meta object needs to be passed efficiently and on an on-demand basis. One of the more important implications of this sharability is that it be shared across multiple technologies, as illustrated in Figure 12.8.

Using Metadata

The whole notion of sharing metadata across different components of the architecture along with the autonomy of metadata at each component level results in a larger architecture. The larger picture shown in Figure 12.9 shows that indeed a balance exists between sharability and autonomy. With this architecture, the end user has all the autonomy desired, and the CIF architect has all the sharability and uniformity of definition that is desired. The conflict between sharability and autonomy that is evident in metadata is resolved quite nicely by the architecture outlined in Figure 12.9.

Consider the usage of this architecture in Figure 12.10, where a DSS analyst at the data mart level desires to know the lineage of attribute ABC in table XYZ. The data mart analyst recognizes that sharable information about table XYZ is available. The DSS analyst asks the data warehouse what information is available. The data warehouse offers sharable information about the table and the attribute in question. If the DSS analyst has what is needed, the question is resolved. If the DSS analyst has what is needed, the question is resolved. If the DSS analyst is still unsure about the table and its attributes, he will push further into the network of sharable and go back to the application. The DSS analyst then learns even more about the metadata in question. If still further questions are being asked, the DSS analyst can look inside the I & T layer or even go back to the enterprise model to see more information about the table and attribute.

Another use of sharable metadata is that of impact analysis (see Figure 12.11). In the figure, an application programmer is getting ready to make a change to some part of the application. The CIF architect will ask the following questions:

  • When was the change made in the application?
  • What elements of data are going to be affected across the CIF?

With the formalized structure of metadata, it is a relatively simple matter to determine what data in what environment will be impacted by a change in the applications environment.

Operational versus DSS Usage

Metadata plays a very different role in the DSS and the operational environments. For example, after driving a familiar route for a period of time, drivers take road signs for granted and ignore their information because it's simply not needed.

If the same activity is repeated over and over, metadata can get in the way of the end user. After the tenth time that an end user repeats the same activity, the end user hardly glances at whatever metadata is present and may even complain that the metadata display gets in the way of doing his or her job.

However, when drivers are in unfamiliar territory, road signs make all the difference in the world. The same can be said for metadata in the DSS environment. When the DSS analyst is doing a new report, metadata is invaluable in telling the DSS analyst what he or she needs to know to get started and do an effective DSS analysis. Metadata plays an entirely different role in the world of DSS than it does in the world of operational systems.

Because of this difference, it is worth noting how metadata relates to the CIF based on the differences between operational processing and DSS processing. Figure 12.12 shows that metadata in the operational environment is a by-product of processing. In fact, metadata in the operational environment is of most use to the developer and the designer. Metadata in the DSS environment, on the other hand, is of great use to the DSS analyst as an active part of the analytical, informational effort. This is due largely to the fact that, in this environment, the end user has more control over how data is interpreted and used.

Versioning of Metadata

One of the characteristics of the data warehouse environment is that it contains a robust supply of historical information. It is not unusual for it to contain 5 to 10 years' worth of information. As such, history is a dimension of the data warehouse that is not present or important elsewhere in the CIF.

Consider the DSS analyst who is trying to compare 1996 data with 1990 data. The DSS analyst may be having a difficult time for a variety of reasons:

  • 1996 data had a different source of data than 1990 data.
  • 1996 data had a different definition of a product than 1990 data's definition.
  • 1996 had a different marketing territory than 1990's marketing territory.

Potentially many different factors may make the data from 1996 incompatible with data from 1990. The DSS analyst needs to know what those factors are if a meaningful comparison of data is to be made across the years. In order to be able to understand the difference in information across the years, the DSS analyst needs to see metadata that is versioned.

Versioned metadata is metadata that is tracked over time. Figure 12.13 shows that as changes are made to data over time, those changes are reflected in the metadata as different versions of metadata are created.

One of the characteristics of versioned metadata is that each occurrence of it contains a from-date and a to-date, resulting in a continuous state record.

Once the continuous versions of metadata are created, the DSS analyst can use those versions to understand the content of data in its historical context. For example, the DSS analysis solicits answers to such key questions as the following:

  • On December 29, 1995, what was the source of data for file XYZ?
  • On October 14, 1996, what was the definition of a product?
  • On July 20, 1994, what was the price of product ABC?

Versioned metadata for the data warehouse adds an extremely important dimension of data.

Archiving and Metadata

In the same vein as versioned metadata, the CIF architect must consider the role of metadata as data is archived. When data is removed from the data warehouse, data mart, or the ODS, it can be discarded or, more likely, archived onto secondary storage. As this happens, it makes sense to store the metadata relating to the archived data along with the archived data. By doing this, the CIF architect ensures that at a later point in time, archival information will be available in the most efficient and effective manner. Figure 12.14 shows that metadata should be stored with the archival information.

Capturing Metadata

The Achilles heel of metadata has always been in its capture. When applications were built in the 1960s, no documentation or captured metadata existed. Organizations realized in the late 1980s and early 1990s that no metadata had been developed for the systems that were written years ago. Trying to go back in time and reconstruct metadata that was attached to systems that were written decades ago was a daunting task. The obstacles of trying to reconstruct metadata 20 years after the fact were many:

1. The people who wrote the systems originally were not available because:

  • They had been promoted.
  • They had left for another job.
  • They had forgotten.
  • They never understood the data in the first place.

2. The budget for development had dried up years ago. Trying to show management tangible benefits from a data dictionary project or a repository project required a huge imagination on the part of the manager footing the bill.

3. Physically gathering the metadata information was its own challenge. In many cases, source code had been lost a long time ago.

4. Even if metadata could be recovered from an old application system, only some metadata was available. A complete picture of metadata was almost impossible to reconstruct.

5. Even if metadata could be recovered from an old legacy system, as updates were made to the system, keeping the updates in synch with the metadata manager— a dictionary or a repository— was almost impossible.

For these reasons and more, capturing applications' metadata after the fact is a very difficult thing to do.

A much better alternative is the capturing of metadata during the active development process. When a tool of automation is used for the development process and is able to produce metadata as a by-product of its code that is produced, then the possibility of creating metadata automatically arises.

Figure 12.15 shows a tool that produces metadata as a by-product of the creation of code. This approach has many advantages:

  • Metadata creation does not require another budgetary line item— it comes automatically.
  • Updating metadata as changes are made is not a problem.
  • Metadata versions are created every time a new version of code is created.
  • Programmers do not know that they are creating metadata.
  • Programmers think they are building systems. The creation of the metadata comes spontaneously, unknown to the programmer. nn The logic of transformation can be trapped. The transformation tool can understand exactly what conversion and reformatting logic is occurring.

The automatic creation of code offers many advantages in the capture of metadata— ultimately, productivity.

Meta-Process Information

While metadata is undoubtedly the center of attention in the CIF, other types of meta objects are available for usage. One such type of meta object is meta-process information. Although metadata is descriptive data about data, meta-process information is descriptive information about code or processes. Meta-process information is useful anywhere there is a large body of code. The three most obvious places in the CIF where this occurs are:

1. At the I & T layer

2. Within the applications

3. As data passes to the data mart from the data warehouse

Uses at the Integration and Transformation Layer

The I & T layer meta-process information is interesting to the DSS analyst as he or she tries to determine how a unit of data was derived from the applications. When the DSS analyst looks at a value of $275 in the data warehouse and sees that the source value was $314, he or she needs to know what was going on in terms of processing logic inside the I & T interface. The DSS analyst needs to see meta-process information about the I & T interface.

Uses within Applications

Within the applications, much editing, capturing, and updating of data occurs.The analyst who will specify the steps needed for integration needs to know what processing will occur in the applications. This description is meta-process information.

Uses from the Data Warehouse to the Data Mart

As data passes from the data warehouse to the data mart, it is customized and summarized to meet the individual demands of the department to which it is being shipped. This process is very interesting to the DSS analyst who must do drill-down processing. In drill-down processing, the DSS analyst goes to successfully lower levels of detail in order to explain to management how a unit of summarization came to be. The DSS analyst, or the tool he or she is using, occasionally needs to drill down past the data mart into the data warehouse. At this point, the DSS analyst needs to see meta-process information about the interface between the data mart and the data warehouse.

Summary

Metadata is the glue that holds the CIF together. Without metadata, the CIF is just a collection of components that manage and use data with no continuity or cohesiveness.

Metadata presents different challenges in that it needs to be both autonomous and sharable at the same time. Unfortunately, these goals are mutually exclusive. In order to be successful, however, both goals need to be simultaneously achieved.

One approach to sharability is through a central repository. It satisfies many of the needs for sharability but does not satisfy the need for autonomy. Another approach to autonomy of metadata is for "everyone to do their own thing." This approach achieves autonomy, but there is no sharability.

An alternate approach is distributed metadata. In distributed metadata, some data is shared and other metadata is autonomous. There needs to be a rigorously defined system of record (that is, an authoritative source) for the shared portion of distributed metadata.

One of the challenges of shared metadata is that of crossing many different lines of technology. Another challenge is the transport of meta objects.

Metadata plays a very different role in the operational environment than it does in the DSS environment. In the operational environment, the applications developer is the primary user of metadata. In the DSS environment, the end user is the primary user of metadata.

Now that we have taken a look at the basic building blocks to the corporate information factory, let's take a look at how these blocks are used to deliver decision support capabilities.