This book has a companion web site at www.data-miners.com
"The book focusses on methodology and techniques and the way that data mining can best be employed. It does not discuss products and vendors. The web site is a place to add your own reviews of vendors and products and exchange data mining war stories with other practitioners. It also provides links to other sites related to data mining."
1 Introduction to Data Mining
The introduction motivates the book by underlining the shift from
mass marketing to one-to-one marketing and making the case
that the tools and techniques described in this book can help make
that shift possible. The introduction takes the reader on a whirlwind
tour of exciting data mining success stories in order to establish the
wide applicability of knowledge discovery techniques to real-world
business problems.
2 The Virtuous Cycle of Data Mining
This chapter defines data mining and introduces the process model
that informs the rest of the book. Data mining is shown to be a
continuing process by which an organization develops better and
better understanding of its business, its markets, and its
customers.The four major phases of the cycle are:
1.Generate ideas and hypotheses;
2.Validate ideas based on patterns in the data;
3.Transform results into actionable segments; and,
4.Measure the results.
3 The Virtuous Cycle in Practice
This chapter outlines several case studies from the authors'
experience. Examples are drawn from the telecommunications,
banking and automotive industries. We draw our examples from
different industries both to illustrate the generality of our approach
and to increase the likelihood that the reader's own industry is
represented.In later chapters, we return to the case studies to
illustrate new points as they come up.
4 What Can Data Mining Do?
This chapter defines and describes the broad classes of tasks that
can be accomplished through the use of advanced data mining
techniques. These classes are
Classification
Clustering
Estimation
Affinity Grouping or Market Basket Analysis
Prediction
Sequential Pattern Matching or Time Series Analysis
5 Data Mining Methodology
This chapter describes a methodology that delivers useful,
measurable results from data mining regardless of the particular
tools and techniques employed.
1.Obtaining clean, pre-classified data with as much detail and as
many fields as possible.
2.Determine an initial data mining goal (a field to be predicted or a
result to be explained).
3.Develop a model of the incremental business value (usually
measured in dollars) of better prediction or classification of the goal.
4.Divide the data into training, test and evaluation sets.
5.Use training set to develop a preliminary model.
6.Use evaluation set to remove effects of overtraining.
7.Use test set to measure predictive power of the model.
8.Apply the model to unclassified data to create actionable segments.
9.Carefully compare the actual behavior of the newly defined segments
with that of a control group.
10.Using the model developed in step 3, measure the value and
cost-effectiveness of the new actionable segments.
11.Feed the results of the experiment back into the database.
12.Generate new hypotheses for testing.
6 Measuring the Effectiveness of Data Mining
This chapter introduces the various statistical measures used to
evaluate the results of data mining. It also introduces the concept of
lift and explains how it can be used to compare the effectiveness of
different approaches to data mining. The simple lift model is than
developed into a model that takes into account the costs and
benefits of the business actions enabled through data mining. It is
only through such a model that the expense of data mining can be
justified in terms of return on investment.
7 Overview of Techniques for Advanced Data
Mining
This chapter introduces the major techniques used for data mining
and compares them with each other and with standard statistics,
explaining the advantages and disadvantages of each. This chapter
gives the reader a general feeling for how the techniques work and
which ones are likely to be applicable to his or her own problem.
The following chapters examine each technique in some depth
complete with examples of how the technique has been applied
successfully. Each of the techniques to which we devote a single
chapter could, and in most cases has been, the subject of an entire
book. Our goal in these chapters is to cover the techniques in
enough technical depth to allow the reader to make an informed
and intelligent judgment as to its applicability in a given situation. It
is certainly not our goal to teach the reader how to implement any
of the techniques. Each chapter will end with suggestions for
further reading.
8 Association Rules and Market Basket Analysis
This is one of the most widely applied techniques. This chapter
explains the statistical underpinnings of the technique and gives
examples of its use in retailing and banking.
9 Memory-Based Reasoning
A less common, but very powerful technique that is especially
useful with hard to categorize data such as free text. The technique
is explained and examples are given from automatic classification
of news stories and analysis of census returns.
10 Automatic Cluster Detection
Cluster detection differs from most of the other techniques
described in the book in that it is generally not directed. Cluster
detection algorithms find groups or records that are self-similar.
The chapter explains several clustering algorithms and gives
examples of their use from the retailing and banking industries.
11 Link Analysis
Link analysis uses graph theory to analyze the connections
between records. The technique is explained and examples are
given from the telecommunications and insurance industries.
12 Decision Trees and Rule Induction
Decision trees are a popular and powerful technique for deriving
classification and prediction rules from data. One of the chief
advantages of these techniques is that the rules produced are easily
expressible in SQL or English. This chapter explains several
variations on the decision tree theme, including CART and CHAID.
Examples of decision trees being applied in practice come from the
banking and manufacturing industries.
13 Artificial Neural Networks
ANN is the most well-established of all the advanced data mining
techniques despite some serious drawbacks. This chapter explains
how neural networks work and describes some of the many
variations that readers are likely to encounter. Examples of neural
networks in use will be drawn from the credit card industry.
Genetic Algorithms This techniques has excited a lot of interest
recently. The technique is explained and an example is given where
genetic algorithms are used to optimize the distance function used
in memory-based reasoning.
14 Genetic Algorithms
This techniques has excited a lot of interest recently. The technique
is explained and an example is given where genetic algorithms are
used to optimize the distance function used in memory-based
reasoning.
15 Data Mining and the Corporate Data
Warehouse
Although it is certainly not necessary to have a data warehouse in
order to do data mining, if a warehouse has been built, data mining
will greatly increase its value. If a warehouse has not yet been built,
data mining can provide guidelines for maximizing the return from
the investment in the warehouse.This chapter will discuss how to
integrate data mining with an existing data warehouse and how to
design a data warehouse that will effectively support data mining.
16 Where Does OLAP Fit In?
The advanced data mining techniques discussed in this book
complement more familiar analytic tools such as multi-dimensional
database front-ends, SQL query generation tools, statistical
packages and spreadsheets. This chapter places data mining within
this larger context and shows how data mining can improve the
effectiveness of other decision support tools.
17 Choosing the Right Tool for the Job
In this chapter, we discuss the issues that must be considered when
selecting software for data mining. We present a checklist of
desirable features and explain the importance of each.
Scalability
Platform independence
Transparency of access to databases and files
Levels of interface
Comprehensibility of output
Availability of graphics and visualization
Ability to handle diverse data types
Ease of use
Availability of support
18 Putting Data Mining to Work
This chapter provides a road map for introducing advanced data
mining into a corporation. We draw on our experience with
successful data mining engagements with MRJ.