Data Mining Notes
What is Data Mining?
Many Definitions
– Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
– Exploration & analysis, by automatic or semi-automatic means, of
large quantities of data in order to discover Meaningful Patterns
Data --> selection --> pre-processing -->
Example Data Mining Questions
Which of the approximately 30K VT students will buy a
new car within the next year?
Which of our customers would be interested paying a
premium for the latest technology?
Would the residents of Virginia or North Carolina likely
provide more revenue for our business if we chose to
relocate?
Which of our high risk mortgage applicants are likely to
file for bankruptcy?
Origins of Data Mining
Ideas come from many disciplines including
machine learning/AI, pattern recognition, statistics,
and database systems
Traditional Techniques may be unsuitable due to:
– Enormity of data
– High dimensionality of the data
– Heterogeneous, distributed nature of data
Types of Data mining Algorithms
Supervised algorithms (Classification) -- "we know what we're looking for"
Learning by example
– Use training data which has correct answers (class label
attribute)
– Create a model by running the algorithm on the training
data
– Identify a class label for the incoming new data
Unsupervised algorithms (Clustering) -- "
– Do not use training data
– Classes may not be known in advance
A lot more complicated & messy (bottom-up approach)
Supervised (Classification)
Decision trees
Regression
Neural Networks
Support Vector Machines
K-Nearest Neighbor approach
Bayesian Classification
Classification: Description
Given a collection of records
– Each record contains a set of attributes, one of the
attributes is the dependent variable/class
Find a model to predict the class attribute as a function of the values of the other attributes
• Goal: previously unseen records should be
assigned to a class as accurately as possible
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build the
model and test set used to validate it
Classification: Example 1
Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of consumers
likely to buy a new cell-phone product.
– Approach:
Use the data for a similar product introduced before
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision (binary attribute) forms
the class attribute
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model
Example 2: Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
– Approach:
• Use credit card transactions and the associated account-
holder information as attributes.
– When does a customer buy, what does he/she buy, how
often does he/she pay on time, etc.
• Label past transactions as fraud or fair transactions. This
forms the class attribute.
• Train the model
• Use this model to detect fraud by observing credit card
transactions on an account
Example 3: Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost to
a competitor.
– Approach:
Use detailed record of transactions with each of the past and present customers, to find attributes.
How often the customer calls, where he/she calls, what time-of-the day he/she calls most, his/her financial status, marital status, etc.
• Label the customers as loyal or disloyal
• Develop a model for loyalty
Example classification approach: k-Nearest Neighbor
Basic idea:
– Look at characteristics / attributes
– “If it walks like a duck and quacks like a duck, then
it’s probably a duck”
Nearest-Neighbor Classifier
Requires three things
(1) The set of stored records
(2) Distance Metric to compute the
distance between records
(3) The value of k, the number of nearest
neighbors to retrieve
l To classify an unknown record:
– Compute distance to other training
records
– Identify k nearest neighbors
– Use class labels of nearest neighbors
to determine the class label of unknown
record (e.g., by taking majority vote,
weighted distance)
