Data Mining

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/27

There's no tags or description

Looks like no tags are added yet.

Last updated 7:49 PM on 4/21/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

28 Terms

New cards

Why data mining?

More intense competition at the global scale.

• Recognition of the value in data sources.

• Availability of quality data on customers, vendors, transactions, Web, etc.

• Consolidation and integration of data repositories into data warehouses.

• The exponential increase in data processing and storage capabilities

• Movement toward conversion of information resources into nonphysical form.

New cards

Data Mining definition

The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases.– Fayyad et al., (1996)

• Other Names

• knowledge extraction

• pattern analysis

• knowledge discovery

• information harvesting

• pattern searching

New cards

Data Mining is a blend of what disciplines?

stats

Machine learning and pattern recognition

information visualization

data base management and warehousing

management science and information systems

New cards

Data Mining Characteristics & Objectives

• Source of data for DM is often a consolidated data warehouse (not always!).

• DM environment is usually a client-server or a Web-based architecture.

• Data is the most critical ingredient for DM includes soft/unstructured data.

• The miner is often an end user.

• Striking it rich requires creative thinking.

• Data mining tools’ capabilities and ease of use are essential

New cards

How Data Mining works?

DM extract patterns from data

• Pattern?

• A mathematical (numeric and/or symbolic) relationship among data items

Types of patterns

• Association

• Prediction

• Cluster (segmentation)

• Sequential (or time series) relationships

New cards

Data Mining Core Techniques

Classification

Prediction

Association Rules & Recommenders

Data & Dimension Reduction

Data Exploration

Visualization

New cards

Data Mining Process

• A manifestation of the best practices

• A systematic way to conduct DM projects

• Moving from Art to Science for DM project

• Everybody has a different version

New cards

Data Mining Process Most common standard processes:

• CRIS PDM(Cross-Industry Standard Process for Data Mining)

• SEMMA(Sample, Explore, Modify, Model, and Assess)

• KDD(Knowledge Discovery in Databases

New cards

Data Mining Process – CRISP-DM

• Cross Industry Standard Process for Data Mining

• Proposed in 1990s by a European consortium

• Composed of six consecutive phases

• Step 1: Business Understanding

• Step 2: Data Understanding

• Step 3: Data Preparation

• Step 4: Model Building

• Step 5: Testing and Evaluation

• Step 6: Deployment

first 3 steps account for 85% of total project time

New cards

Data Mining Process – SEMMA

Developed by SAS Institute

Sample

Explore

Modify

Model

Assess

New cards

Data Mining Process – KDD

KDD (Knowledge Discovery in Databases) Process

New cards

Other Data Mining Patterns/Tasks

Time-series forecasting

• Part of the sequence or link analysis

Visualization

• Another data mining task

• Covered in earlier lessons

Data Mining versus Statistics

• Are they the same

• What is the relationship between the two

New cards

Data Mining Applications: CRM and Banking

Customer Relationship Management

• Maximize return on marketing campaigns •

Improve customer retention (churn analysis)

• Maximize customer value (cross-, up-selling)

• Identify and treat most valued customers

Banking & Other Financial •

Automate the loan application process

• Detecting fraudulent transactions

• Maximize customer value (cross-, up-selling)

• Optimizing cash reserves with forecasting

New cards

Data Mining Applications: Retail and Manufacturing

Retailing and Logistics

• Optimize inventory levels at different locations

• Improve the store layout and sales promotions

• Optimize logistics by predicting seasonal effects

• Minimize losses due to limited shelf life

Manufacturing and Maintenance

• Predict/prevent machinery failures

• Identify anomalies in production systems to optimize the use manufacturing capacity

• Discover novel patterns to improve product quality

New cards

Data Mining Applications: Brokerage and Insurance

Brokerage and Securities Trading

• Predict changes on certain bond prices

• Forecast the direction of stock fluctuations

• Assess the effect of events on market movements

• Identify and prevent fraudulent activities in trading

Insurance

• Forecast claim costs for better business planning

• Determine optimal rate plans

• Optimize marketing to specific customers

• Identify and prevent fraudulent claim activities

New cards

Data Mining Applications

• Computer hardware and software

• Science and engineering

• Government and defense

• Homeland security and law enforcement

• Travel, entertainment, sports

• Healthcare and medicine

• Sports,… virtually everywhere…

New cards

Data Mining Steps

• Define/understand purpose

Obtain data (may involve random sampling)

Explore, clean, pre-process data

Reduce the data; if supervised DM, partition it

Specify task (classification, clustering, etc.)

Choose the techniques (regression, CART, neural networks, etc.)

Iterative implementation and “tuning”

Assess results – compare models Deploy best model

New cards

Data Mining Methods: Classification

• Most frequently used DM method

• Part of the machine-learning family

• Employ supervised learning

• Learn from past data, classify new data

• The output variable is categorical (nominal or ordinal) in nature

• Classification versus regression?

• Classification versus clustering?

New cards

Assessment Methods for Classification

Predictive accuracy

• Hit rate Speed

• Model building versus predicting/usage speed

Robustness

Scalability

Interpretability

• Transparency, explainability

New cards

Accuracy of Classification Models

Acc = (TP + TN)/ TP+ TN+FP+FN

True PosRate = TP/ TP+FN

TrueNeg Rate = TN/TN+FP

Precision = TP/ TP+FP

Recall = TP/TP+FN

New cards

Estimation Methodologies for Classification: Single/Simple Split

Simple split (or holdout or test sample estimation)

• Split the data into 2 mutually exclusive sets: training (~70%) and testing (30%)

– For Neural Networks, the data is split into three subsets (training [~60%], validation [~20%], testing [~20%])

New cards

Estimation Methodologies for Classification: k-Fold Cross Validation (rotation estimation)

Data is split into k mutual subsets and k number training/testing experiments are conducted

New cards

Additional Estimation Methodologies for Classification

Leave-one-out

• Similar to k-fold where k = number of samples

Bootstrapping

• Random sampling with replacement

Jackknifing

Similar to leave-one-out

Area Under the ROC Curve (AUC)

• ROC: receiver operating characteristics (a term borrowed from radar image processing)

New cards

Data Exploration – Sampling

Data mining typically deals with huge databases

Sampling required for piloting/prototyping

Algorithms and models are applied to a sample to produce statistically-valid results

final model, you use it to “score” (predict values or classes for) the observations in the larger database

New cards

Data Exploration – Over Sampling

Often the event of interest is rare

Examples: response to mailing, fraud in taxes, …

Sampling may yield too few “interesting” cases to effectively train a model

A popular solution: oversample the rare cases to obtain a more balanced training set

Later, need to adjust results for the oversampling

New cards

Data Mining – Explore, clean, pre-process data

Determine the types of pre-processing needed, and algorithms used

Main distinction:

Categorical vs. numeric • • •

Numeric

Continuous

Integer

Categorical • •

Ordered (low, medium, high)

Unordered (male, female

New cards

Data Mining – Variable Handling

Numeric

Most algorithms can handle numeric data

• May occasionally need to “bin” into categories •

Categorical

Naïve Bayes can use as-is

• In most other algorithms, must create binary dummies (number of dummies = number of categories – 1) [see Table 2.6 for R code]

New cards

Data Mining Mistakes

• Selecting the wrong problem for data mining

• Ignoring what your sponsor thinks data mining is and what it really can/cannot do

• Beginning without the end in mind

• Not leaving sufficient time for data acquisition, selection, and preparation

• Looking only at aggregated results and not at individual records/predictions