This book has a companion web site at www.data-miners.com
"The book focusses on methodology and techniques and the way that data mining can best be employed. It does not discuss products and vendors. The web site is a place to add your own reviews of vendors and products and exchange data mining war stories with other practitioners. It also provides links to other sites related to data mining."
1 Introduction to Data Mining
The introduction motivates the book by underlining the shift from mass marketing to one-to-one marketing and making the case that the tools and techniques described in this book can help make that shift possible. The introduction takes the reader on a whirlwind tour of exciting data mining success stories in order to establish the wide applicability of knowledge discovery techniques to real-world business problems.
2 The Virtuous Cycle of Data Mining
This chapter defines data mining and introduces the process model that informs the rest of the book. Data mining is shown to be a continuing process by which an organization develops better and better understanding of its business, its markets, and its customers.The four major phases of the cycle are:
1.Generate ideas and hypotheses;
2.Validate ideas based on patterns in the data;
3.Transform results into actionable segments; and,
4.Measure the results.
3 The Virtuous Cycle in Practice
This chapter outlines several case studies from the authors' experience. Examples are drawn from the telecommunications, banking and automotive industries. We draw our examples from different industries both to illustrate the generality of our approach and to increase the likelihood that the reader's own industry is represented.In later chapters, we return to the case studies to illustrate new points as they come up.
4 What Can Data Mining Do?
This chapter defines and describes the broad classes of tasks that can be accomplished through the use of advanced data mining techniques. These classes are
Affinity Grouping or Market Basket Analysis
Sequential Pattern Matching or Time Series Analysis
5 Data Mining Methodology
This chapter describes a methodology that delivers useful, measurable results from data mining regardless of the particular tools and techniques employed.
1.Obtaining clean, pre-classified data with as much detail and as
many fields as possible.
2.Determine an initial data mining goal (a field to be predicted or a result to be explained).
3.Develop a model of the incremental business value (usually measured in dollars) of better prediction or classification of the goal.
4.Divide the data into training, test and evaluation sets.
5.Use training set to develop a preliminary model.
6.Use evaluation set to remove effects of overtraining.
7.Use test set to measure predictive power of the model.
8.Apply the model to unclassified data to create actionable segments.
9.Carefully compare the actual behavior of the newly defined segments with that of a control group.
10.Using the model developed in step 3, measure the value and cost-effectiveness of the new actionable segments.
11.Feed the results of the experiment back into the database.
12.Generate new hypotheses for testing.
6 Measuring the Effectiveness of Data Mining
This chapter introduces the various statistical measures used to evaluate the results of data mining. It also introduces the concept of lift and explains how it can be used to compare the effectiveness of different approaches to data mining. The simple lift model is than developed into a model that takes into account the costs and benefits of the business actions enabled through data mining. It is only through such a model that the expense of data mining can be justified in terms of return on investment.
7 Overview of Techniques for Advanced Data
This chapter introduces the major techniques used for data mining and compares them with each other and with standard statistics, explaining the advantages and disadvantages of each. This chapter gives the reader a general feeling for how the techniques work and which ones are likely to be applicable to his or her own problem. The following chapters examine each technique in some depth complete with examples of how the technique has been applied successfully. Each of the techniques to which we devote a single chapter could, and in most cases has been, the subject of an entire book. Our goal in these chapters is to cover the techniques in enough technical depth to allow the reader to make an informed and intelligent judgment as to its applicability in a given situation. It is certainly not our goal to teach the reader how to implement any of the techniques. Each chapter will end with suggestions for further reading.
8 Association Rules and Market Basket Analysis
This is one of the most widely applied techniques. This chapter explains the statistical underpinnings of the technique and gives examples of its use in retailing and banking.
9 Memory-Based Reasoning
A less common, but very powerful technique that is especially useful with hard to categorize data such as free text. The technique is explained and examples are given from automatic classification of news stories and analysis of census returns.
10 Automatic Cluster Detection
Cluster detection differs from most of the other techniques described in the book in that it is generally not directed. Cluster detection algorithms find groups or records that are self-similar. The chapter explains several clustering algorithms and gives examples of their use from the retailing and banking industries.
11 Link Analysis
Link analysis uses graph theory to analyze the connections between records. The technique is explained and examples are given from the telecommunications and insurance industries.
12 Decision Trees and Rule Induction
Decision trees are a popular and powerful technique for deriving classification and prediction rules from data. One of the chief advantages of these techniques is that the rules produced are easily expressible in SQL or English. This chapter explains several variations on the decision tree theme, including CART and CHAID. Examples of decision trees being applied in practice come from the banking and manufacturing industries.
13 Artificial Neural Networks
ANN is the most well-established of all the advanced data mining techniques despite some serious drawbacks. This chapter explains how neural networks work and describes some of the many variations that readers are likely to encounter. Examples of neural networks in use will be drawn from the credit card industry. Genetic Algorithms This techniques has excited a lot of interest recently. The technique is explained and an example is given where genetic algorithms are used to optimize the distance function used in memory-based reasoning.
14 Genetic Algorithms
This techniques has excited a lot of interest recently. The technique is explained and an example is given where genetic algorithms are used to optimize the distance function used in memory-based reasoning.
15 Data Mining and the Corporate Data
Although it is certainly not necessary to have a data warehouse in order to do data mining, if a warehouse has been built, data mining will greatly increase its value. If a warehouse has not yet been built, data mining can provide guidelines for maximizing the return from the investment in the warehouse.This chapter will discuss how to integrate data mining with an existing data warehouse and how to design a data warehouse that will effectively support data mining.
16 Where Does OLAP Fit In?
The advanced data mining techniques discussed in this book complement more familiar analytic tools such as multi-dimensional database front-ends, SQL query generation tools, statistical packages and spreadsheets. This chapter places data mining within this larger context and shows how data mining can improve the effectiveness of other decision support tools.
17 Choosing the Right Tool for the Job
In this chapter, we discuss the issues that must be considered when selecting software for data mining. We present a checklist of desirable features and explain the importance of each.
Transparency of access to databases and files
Levels of interface
Comprehensibility of output
Availability of graphics and visualization
Ability to handle diverse data types
Ease of use
Availability of support
18 Putting Data Mining to Work
This chapter provides a road map for introducing advanced data mining into a corporation. We draw on our experience with successful data mining engagements with MRJ.