Big Data - getting it right: A checklist to evaluate your environment

by W. H. Inmon,
Forest Rim Technology LLC

Anyone who has been awake for the past few years cannot but help to have noticed that there is a groundswell of interest in Big Data.

Big Data is the technology that has the following properties:

  • Very large amounts of data can be handled
  • The storage medium for the data is inexpensive
  • Data is managed by means of the “Roman Census” method
  • Data managed under Big Data is unstructured

Big Data certainly has potential. There is a wealth of information that is available in Big Data. But the reality of unlocking that potential is such that most corporations are failing. Consider the three following different anecdotal points of reference –

Article WALL STREET JOURNAL, Dec 2013 – “a recent survey states that the return on investment of Big Data has been a disappointing $.55 per dollar spent…”

Manager of Big Data for a large consulting firm – “in the past 18 months our firm has done over 150 proofs of concept for Big Data. Five of those projects ended up going into production. The remaining projects were abandoned. The failure rate for Big Data is over 95%...”

Large financial institute – “we have been working on Big Data for two years now. We have bought everything our vendors have told us to buy. We have tried everything our vendors have told us to try. But after two years time we have nothing in the way of business value to show for our efforts and investment.”

So there is no question that there is great potential in Big Data. And there is no question that organizations have put a lot of resources and effort into trying to make Big Data succeed. But where are the concrete results? In fact is it possible to find concrete results? And why are concrete results so difficult to come by with Big Data?

To that end I have written this short simple paper – what do you need to do in order to achieve concrete results with your Big Data project?

There are twelve recommendations that we need to follow in order to ensure concrete results:


Do not build your Big Data infrastructure and hope that you can find business value. Before you start to build even the first part of your Big Data infrastructure you need to have a clear idea of what you expect to find and what value that will have for your organization. In addition you need to understand whether your value is in the form of operational, day to day data or whether you expect the value to be in the form of informational, analytical data. Before you start, you need to have a very clear idea what output you expect, the business value of that output, and whether that output will be used operationally or informationally. If you do not have a clear and concise definition of your expectations before you start, you should not be doing a Big Data project.


The structure of Big Data is fundamentally different from the structure of classical data base management systems. You cannot use classical analytical tools against Big Data. But in order to for you to derive business value out of Big Data you must determine how you are going to do analysis. Analyzing Big Data is a completely different proposition than analyzing classical structured systems.


There is a fundamental difference between search and analysis. Search is a simple count of objects. Analysis requires context in order to qualify the object during the search process. In order to do analysis you need to be able to derive the context of your Big Data. Do not think that just because you can count raw data that you can analyze it. You need to understand that there is a fundamental difference between search and analysis. In almost every case, in order to get business value from data you have to do analysis, not a simple count of objects. If you don’t understand the difference between search and analysis you shouldn’t be doing a Big Data project.


Do not attempt to build your Big Data infrastructure in a Big Bang approach. There is no reason to build your environment all at once and there is every reason to build your infrastructure a step at a time. Building it a step at a time ensures that you can make mistakes and have minimal consequences. Given that the Big Data environment is brand new, it is a sure thing that there will be mistakes. Make sure that the consequences of those mistakes are minimal and recoverable, not large, expensive, and politically embarrassing.


Many shops build their Big Data environment as if the Big Data environment were going to exist on a different planet than the existing operational/analytical environment. Such will NEVER be the case. Such mundane subjects as how to transfer data, how to find and compare keys, how to audit the quality of data, how to create a unified effort/result from the Big Data/operational/analytical environment needs to be addressed BEFORE the Big Data infrastructure is built. The successful Big Data environment is one that is integrated smoothly with the existing corporate analytical environment.


It is inevitable that some Big Data will be more useful than other Big Data. That simply is the nature of Big Data. And trying to lump all of your Big Data together is a terrible strategy. You need to be able to separate your Big Data according to its usefulness. Determining that some data is less than useful does not mean that you should throw the data that is less sueful away. It simply means that there is a hierarchy of data based on importance and usefulness of data. This hierarchy should be recognized in your architecture.


No one vendor has a complete solution (despite what the vendor tells you.) If you say that all you are going to use is technology from a single vendor, then you are going to greatly limit your chances of success.


Is your user the IT department? Is your user marketing? Sales? Finance? Management? In many cases it is not clear who the user is. It really helps you to understand your objectives to understand who your user is (and is not). There are a thousand good reasons for catering to your ultimate corporate end user – political, economic, technological, and so forth.


Unless you have a clearly stated objective and a clearly stated means of measuring success, you will never be able to tell whether your Big Data project has been successful or not. The measurement of success can take many forms. It can take the form of availability of new data, of new queries being written and satisfied, of increased revenue, of increased sales leads, and so forth. If you are serious about success with a Big Data project, you will outline the criteria for success at the outset.


One thing is certain with Big Data, and that certainty is that there will be new data to exploit. But exploiting data is an art. An infrastructure is required for exploitation. But the right people with the right motivation are required as well. Exploration of data requires a different mind set and a different set of skills than most organizations are staffed for. Most organizations are geared up for creating and examining a set number of key performance indicators (KPI’s). While the organization certainly needs people with the repetitive KPI mindset, with Big Data there needs to be a complementary set of skills that are geared for finding new KPI’s and new opportunities.


All Big Data is unstructured. As such there is NO context to be found with Big Data in the normal sense of context There are no attributes, no keys, no records. But context is necessary in order to do sophisticated analytical processing. Therefore it is mandatory that the organization need to understand how to do textual disambiguation. With textual disambiguation the context that is naturally in the unstructured text is found and structured into a form that is familiar to analytical processing. The problem is that vendors of Big Data technology have little or no understanding of the technology of textual disambiguation.


Underlying all of Big Data is the fact that in order to do effective analytical processing it is necessary to support metadata. In Big Data there are different forms of metadata, and all of them are needed in order to effectively use and analyze the information found in Big Data.

A simple little self readiness test can then be constructed. This test is like solitaire. You are only cheating yourself if you fudge on the answers:

  1. Do I have a clear vision of my business objectives for Big Data? Yes/no
  2. Do I know how to do sophisticated analysis on Big Data when I get it captured? Yes/no
  3. Do I understand the differences between search and analysis? Yes/no
  4. Is my infrastructure built (or to be built) iteratively? Yes/no
  5. Can I relate the data found in my Big Data environment to the data in my existing analytical environment? Yes/no
  6. Do I know how to separate my useful Big Data data from my less then useful Big Data data? Yes/no
  7. Am I open to technology from multiple vendors? Yes/no
  8. Do I know who my end user of the data found in Big Data will be? Yes/no
  9. Do I have a clear and concise way to measure the success of the Big Data project? Yes/no
  10. Do I have the people and tools with the know how to explore new types of data? Yes/no
  11. Do I or my vendor know what textual disambiguation is and why it is central to success with Big Data? Yes/no
  12. Do I know how I am going to identify and support metadata from the Big Data environment? Yes/no

If you scored 12 yeses your chances of success are very high. If you scored from 9 to 11 yeses you have a reasonable chance of success. If you scored from 5 to 8 yeses you probably are advised to do some more research and preparation before you waste money on a Big Data project. If you scored less than 5 yeses Big Data is almost sure to be a wasted effort in your environment.

The truth is that ALL of these factors are needed in order for Big Data to be a success. Unfortunately MOST of them are ignored by the vendors of Big Data. The vendors of Big Data focus on what is familiar and known to them. If it is out of their comfort zone then they simply ignore the factor and try to push more of their technology down the throat of their customers.

At the end of the day the vendor is there to sell his/her product, not to make your organization successful.

Question from Daniel Power. What does it mean when data is managed by means of the “Roman Census” method?

Bill Inmon Response:

Once upon a time the Romans decided that they wanted to tax everyone in the Roman empire. But in order to tax the citizens of the Roman empire the Romans first had have a census. The Romans quickly figured out that trying to get every person in the Roman empire to march through the gates of Rome in order to be counted was an impossibility. There were people in North Africa, in Spain, in Germany, in Greece, in Persia, in Israel, and so forth. So creating a census where the processing (i.e., the counting) was done centrally was an impossibility. The Romans solved the problem by creating a body of "census takers". The census takers were sent all over the Roman empire and on the appointed day a census was taken. Then the census takers headed back to Rome where the results were tabulated centrally.

In such a fashion the work being done was sent to the data, rather than trying to send the data to a central location and doing the work in one place. By distributing the processing, the Romans solved the problem of creating a census over a large diverse population. Most people don't realize that they are very familiar with the Roman census method and don't know it. You see there once was a story about two people - Mary and Joseph - who had to travel to a small city - Bethlehem - for the taking of a Roman census. On the way there Mary had a little baby boy - named Jesus - in a manger. Thus born was the religion many people are familiar with - Christianity. The Roman census approach is intimately entwined with the birth of Christianity.

The Roman census method then says that you don't centralize processing if you have a large amount of data to process. Instead you send the processing to the data. You distribute the processing. In doing so you can service the processing over an effectively unlimited amount of data.

About the Author

Bill Inmon is President of Forest Rim Technology LLC ( Best known as the “Father of Data Warehousing”, Bill Inmon has become the most prolific and well-known author worldwide in the data warehousing and business intelligence arena. In addition to authoring more than 50 books and 650 articles, Bill has been a monthly columnist with the Business Intelligence Network, EIM Institute and Data Management Review. In 2007, Bill was named by Computerworld as one of the “Ten IT People Who Mattered in the Last 40 Years” of the computer profession. Having 35 years of experience in database technology and data warehouse design, he is known globally for his seminars on developing data warehouses and information architectures. Bill has been a keynote speaker in demand for numerous computing associations, industry conferences and trade shows. Bill Inmon also has an extensive entrepreneurial background: He founded Pine Cone Systems, later named Ambeo in 1995, and founded, and took public, Prism Solutions in 1991. Bill consults with a large number of Fortune 1000 clients, and leading IT executives on Data Warehousing, Business Intelligence, and Database Management, offering data warehouse design and database management services, as well as producing methodologies and technologies that advance the enterprise architectures of large and small organizations world-wide. He has worked for American Management Systems and Coopers & Lybrand. Bill received his Bachelor of Science degree in Mathematics from Yale University, and his Master of Science degree in Computer Science from New Mexico State University. He makes his home in Colorado.


Inmon, W.H., "Big Data - getting it right: A checklist to evaluate your environment", DSSResources.COM, 01/16/2014.

Bill Inmon provided permission to post this article at by email. This article was posted at DSSResources.COM on January 16, 2014..