Data Mining Notes

What is Data Mining?

Many Definitions

– Non-trivial extraction of implicit, previously unknown and

potentially useful information from data

– Exploration & analysis, by automatic or semi-automatic means, of

large quantities of data in order to discover Meaningful Patterns

Data --> selection --> pre-processing -->

Example Data Mining Questions
- Which of the approximately 30K VT students will buy a

new car within the next year?

Which of our customers would be interested paying a

premium for the latest technology?

Would the residents of Virginia or North Carolina likely

provide more revenue for our business if we chose to

relocate?

Which of our high risk mortgage applicants are likely to

file for bankruptcy?

Origins of Data Mining
- Ideas come from many disciplines including

machine learning/AI, pattern recognition, statistics,

and database systems

Traditional Techniques may be unsuitable due to:

– Enormity of data

– High dimensionality of the data

– Heterogeneous, distributed nature of data

Types of Data mining Algorithms
- Supervised algorithms (Classification) -- "we know what we're looking for"

Learning by example

– Use training data which has correct answers (class label

attribute)

– Create a model by running the algorithm on the training

data

– Identify a class label for the incoming new data

Unsupervised algorithms (Clustering) -- "

– Do not use training data

– Classes may not be known in advance

A lot more complicated & messy (bottom-up approach)

Supervised (Classification)
- Decision trees
- Regression
- Neural Networks
- Support Vector Machines
- K-Nearest Neighbor approach
- Bayesian Classification

Classification: Description
- Given a collection of records

– Each record contains a set of attributes, one of the

attributes is the dependent variable/class

Find a model to predict the class attribute as a function of the values of the other attributes

• Goal: previously unseen records should be

assigned to a class as accurately as possible

– A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into

training and test sets, with training set used to build the

model and test set used to validate it

Classification: Example 1
- Direct Marketing

– Goal: Reduce cost of mailing by targeting a set of consumers

likely to buy a new cell-phone product.

– Approach:

Use the data for a similar product introduced before

• We know which customers decided to buy and which decided

otherwise. This {buy, don’t buy} decision (binary attribute) forms

the class attribute

• Collect various demographic, lifestyle, and company-interaction

related information about all such customers.

– Type of business, where they stay, how much they earn, etc.

• Use this information as input attributes to learn a classifier model

Example 2: Fraud Detection

Goal: Predict fraudulent cases in credit card transactions.

– Approach:

• Use credit card transactions and the associated account-

holder information as attributes.

– When does a customer buy, what does he/she buy, how

often does he/she pay on time, etc.

• Label past transactions as fraud or fair transactions. This

forms the class attribute.

• Train the model

• Use this model to detect fraud by observing credit card

transactions on an account

Example 3: Customer Attrition/Churn:
- Goal: To predict whether a customer is likely to be lost to

a competitor.

– Approach:

Use detailed record of transactions with each of the past and present customers, to find attributes.
How often the customer calls, where he/she calls, what time-of-the day he/she calls most, his/her financial status, marital status, etc.

• Label the customers as loyal or disloyal

• Develop a model for loyalty

Example classification approach: k-Nearest Neighbor
- Basic idea:

– Look at characteristics / attributes

– “If it walks like a duck and quacks like a duck, then

it’s probably a duck”

Nearest-Neighbor Classifier

Requires three things

(1) The set of stored records

(2) Distance Metric to compute the

distance between records

(3) The value of k, the number of nearest

neighbors to retrieve

l To classify an unknown record:

– Compute distance to other training

records

– Identify k nearest neighbors

– Use class labels of nearest neighbors

to determine the class label of unknown

record (e.g., by taking majority vote,

weighted distance)