Data Mining

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/27

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 7:49 PM on 4/21/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

28 Terms

1
New cards

Why data mining?

More intense competition at the global scale.

• Recognition of the value in data sources.

• Availability of quality data on customers, vendors, transactions, Web, etc.

• Consolidation and integration of data repositories into data warehouses.

• The exponential increase in data processing and storage capabilities

• Movement toward conversion of information resources into nonphysical form.

2
New cards

Data Mining definition

The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases.– Fayyad et al., (1996)

• Other Names

• knowledge extraction

• pattern analysis

• knowledge discovery

• information harvesting

• pattern searching

3
New cards

Data Mining is a blend of what disciplines?

stats

AI

Machine learning and pattern recognition

information visualization

data base management and warehousing

management science and information systems

4
New cards

Data Mining Characteristics & Objectives

• Source of data for DM is often a consolidated data warehouse (not always!).

• DM environment is usually a client-server or a Web-based architecture.

• Data is the most critical ingredient for DM includes soft/unstructured data.

• The miner is often an end user.

• Striking it rich requires creative thinking.

• Data mining tools’ capabilities and ease of use are essential

5
New cards

How Data Mining works?

DM extract patterns from data

• Pattern?

• A mathematical (numeric and/or symbolic) relationship among data items

Types of patterns

• Association

• Prediction

• Cluster (segmentation)

• Sequential (or time series) relationships

6
New cards

Data Mining Core Techniques

Classification

Prediction

Association Rules & Recommenders

Data & Dimension Reduction

Data Exploration

Visualization

7
New cards

Data Mining Process

• A manifestation of the best practices

• A systematic way to conduct DM projects

• Moving from Art to Science for DM project

• Everybody has a different version

8
New cards

Data Mining Process Most common standard processes:

• CRIS PDM(Cross-Industry Standard Process for Data Mining)

• SEMMA(Sample, Explore, Modify, Model, and Assess)

• KDD(Knowledge Discovery in Databases

9
New cards

Data Mining Process – CRISP-DM

• Cross Industry Standard Process for Data Mining

• Proposed in 1990s by a European consortium

• Composed of six consecutive phases

• Step 1: Business Understanding

• Step 2: Data Understanding

• Step 3: Data Preparation

• Step 4: Model Building

• Step 5: Testing and Evaluation

• Step 6: Deployment

first 3 steps account for 85% of total project time

10
New cards

Data Mining Process – SEMMA

Developed by SAS Institute

Sample

Explore

Modify

Model

Assess

11
New cards

Data Mining Process – KDD

KDD (Knowledge Discovery in Databases) Process

12
New cards

Other Data Mining Patterns/Tasks

Time-series forecasting

• Part of the sequence or link analysis

Visualization

• Another data mining task

• Covered in earlier lessons

Data Mining versus Statistics

• Are they the same

• What is the relationship between the two

13
New cards

Data Mining Applications: CRM and Banking

Customer Relationship Management

• Maximize return on marketing campaigns •

Improve customer retention (churn analysis)

• Maximize customer value (cross-, up-selling)

• Identify and treat most valued customers

Banking & Other Financial •

Automate the loan application process

• Detecting fraudulent transactions

• Maximize customer value (cross-, up-selling)

• Optimizing cash reserves with forecasting

14
New cards

Data Mining Applications: Retail and Manufacturing

Retailing and Logistics

• Optimize inventory levels at different locations

• Improve the store layout and sales promotions

• Optimize logistics by predicting seasonal effects

• Minimize losses due to limited shelf life

Manufacturing and Maintenance

• Predict/prevent machinery failures

• Identify anomalies in production systems to optimize the use manufacturing capacity

• Discover novel patterns to improve product quality

15
New cards

Data Mining Applications: Brokerage and Insurance

Brokerage and Securities Trading

• Predict changes on certain bond prices

• Forecast the direction of stock fluctuations

• Assess the effect of events on market movements

• Identify and prevent fraudulent activities in trading

Insurance

• Forecast claim costs for better business planning

• Determine optimal rate plans

• Optimize marketing to specific customers

• Identify and prevent fraudulent claim activities

16
New cards

Data Mining Applications

• Computer hardware and software

• Science and engineering

• Government and defense

• Homeland security and law enforcement

• Travel, entertainment, sports

• Healthcare and medicine

• Sports,… virtually everywhere…

17
New cards

Data Mining Steps

• Define/understand purpose

Obtain data (may involve random sampling)

Explore, clean, pre-process data

Reduce the data; if supervised DM, partition it

Specify task (classification, clustering, etc.)

Choose the techniques (regression, CART, neural networks, etc.)

Iterative implementation and “tuning”

Assess results – compare models Deploy best model

18
New cards

Data Mining Methods: Classification

• Most frequently used DM method

• Part of the machine-learning family

• Employ supervised learning

• Learn from past data, classify new data

• The output variable is categorical (nominal or ordinal) in nature

• Classification versus regression?

• Classification versus clustering?

19
New cards

Assessment Methods for Classification

Predictive accuracy

• Hit rate Speed

• Model building versus predicting/usage speed

Robustness

Scalability

Interpretability

• Transparency, explainability

20
New cards

Accuracy of Classification Models

Acc = (TP + TN)/ TP+ TN+FP+FN

True PosRate = TP/ TP+FN

TrueNeg Rate = TN/TN+FP

Precision = TP/ TP+FP

Recall = TP/TP+FN

21
New cards

Estimation Methodologies for Classification: Single/Simple Split

Simple split (or holdout or test sample estimation)

• Split the data into 2 mutually exclusive sets: training (~70%) and testing (30%)

– For Neural Networks, the data is split into three subsets (training [~60%], validation [~20%], testing [~20%])

22
New cards

Estimation Methodologies for Classification: k-Fold Cross Validation (rotation estimation)

Data is split into k mutual subsets and k number training/testing experiments are conducted

23
New cards

Additional Estimation Methodologies for Classification

Leave-one-out

• Similar to k-fold where k = number of samples

Bootstrapping

• Random sampling with replacement

Jackknifing

Similar to leave-one-out

Area Under the ROC Curve (AUC)

• ROC: receiver operating characteristics (a term borrowed from radar image processing)

24
New cards

Data Exploration – Sampling

Data mining typically deals with huge databases

Sampling required for piloting/prototyping

Algorithms and models are applied to a sample to produce statistically-valid results

final model, you use it to “score” (predict values or classes for) the observations in the larger database

25
New cards

Data Exploration – Over Sampling

Often the event of interest is rare

Examples: response to mailing, fraud in taxes, …

Sampling may yield too few “interesting” cases to effectively train a model

A popular solution: oversample the rare cases to obtain a more balanced training set

Later, need to adjust results for the oversampling

26
New cards

Data Mining – Explore, clean, pre-process data

Determine the types of pre-processing needed, and algorithms used

Main distinction:

Categorical vs. numeric • • •

Numeric

Continuous

Integer

Categorical • •

Ordered (low, medium, high)

Unordered (male, female

27
New cards

Data Mining – Variable Handling

Numeric

Most algorithms can handle numeric data

• May occasionally need to “bin” into categories •

Categorical

Naïve Bayes can use as-is

• In most other algorithms, must create binary dummies (number of dummies = number of categories – 1) [see Table 2.6 for R code]

28
New cards

Data Mining Mistakes

• Selecting the wrong problem for data mining

• Ignoring what your sponsor thinks data mining is and what it really can/cannot do

• Beginning without the end in mind

• Not leaving sufficient time for data acquisition, selection, and preparation

• Looking only at aggregated results and not at individual records/predictions