1/27
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Why data mining?
More intense competition at the global scale.
• Recognition of the value in data sources.
• Availability of quality data on customers, vendors, transactions, Web, etc.
• Consolidation and integration of data repositories into data warehouses.
• The exponential increase in data processing and storage capabilities
• Movement toward conversion of information resources into nonphysical form.
Data Mining definition
The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases.– Fayyad et al., (1996)
• Other Names
• knowledge extraction
• pattern analysis
• knowledge discovery
• information harvesting
• pattern searching
Data Mining is a blend of what disciplines?
stats
AI
Machine learning and pattern recognition
information visualization
data base management and warehousing
management science and information systems
Data Mining Characteristics & Objectives
• Source of data for DM is often a consolidated data warehouse (not always!).
• DM environment is usually a client-server or a Web-based architecture.
• Data is the most critical ingredient for DM includes soft/unstructured data.
• The miner is often an end user.
• Striking it rich requires creative thinking.
• Data mining tools’ capabilities and ease of use are essential
How Data Mining works?
DM extract patterns from data
• Pattern?
• A mathematical (numeric and/or symbolic) relationship among data items
Types of patterns
• Association
• Prediction
• Cluster (segmentation)
• Sequential (or time series) relationships
Data Mining Core Techniques
Classification
Prediction
Association Rules & Recommenders
Data & Dimension Reduction
Data Exploration
Visualization
Data Mining Process
• A manifestation of the best practices
• A systematic way to conduct DM projects
• Moving from Art to Science for DM project
• Everybody has a different version
Data Mining Process Most common standard processes:
• CRIS PDM(Cross-Industry Standard Process for Data Mining)
• SEMMA(Sample, Explore, Modify, Model, and Assess)
• KDD(Knowledge Discovery in Databases
Data Mining Process – CRISP-DM
• Cross Industry Standard Process for Data Mining
• Proposed in 1990s by a European consortium
• Composed of six consecutive phases
• Step 1: Business Understanding
• Step 2: Data Understanding
• Step 3: Data Preparation
• Step 4: Model Building
• Step 5: Testing and Evaluation
• Step 6: Deployment
first 3 steps account for 85% of total project time
Data Mining Process – SEMMA
Developed by SAS Institute
Sample
Explore
Modify
Model
Assess
Data Mining Process – KDD
KDD (Knowledge Discovery in Databases) Process
Other Data Mining Patterns/Tasks
Time-series forecasting
• Part of the sequence or link analysis
Visualization
• Another data mining task
• Covered in earlier lessons
Data Mining versus Statistics
• Are they the same
• What is the relationship between the two
Data Mining Applications: CRM and Banking
Customer Relationship Management
• Maximize return on marketing campaigns •
Improve customer retention (churn analysis)
• Maximize customer value (cross-, up-selling)
• Identify and treat most valued customers
Banking & Other Financial •
Automate the loan application process
• Detecting fraudulent transactions
• Maximize customer value (cross-, up-selling)
• Optimizing cash reserves with forecasting
Data Mining Applications: Retail and Manufacturing
Retailing and Logistics
• Optimize inventory levels at different locations
• Improve the store layout and sales promotions
• Optimize logistics by predicting seasonal effects
• Minimize losses due to limited shelf life
Manufacturing and Maintenance
• Predict/prevent machinery failures
• Identify anomalies in production systems to optimize the use manufacturing capacity
• Discover novel patterns to improve product quality
Data Mining Applications: Brokerage and Insurance
Brokerage and Securities Trading
• Predict changes on certain bond prices
• Forecast the direction of stock fluctuations
• Assess the effect of events on market movements
• Identify and prevent fraudulent activities in trading
Insurance
• Forecast claim costs for better business planning
• Determine optimal rate plans
• Optimize marketing to specific customers
• Identify and prevent fraudulent claim activities
Data Mining Applications
• Computer hardware and software
• Science and engineering
• Government and defense
• Homeland security and law enforcement
• Travel, entertainment, sports
• Healthcare and medicine
• Sports,… virtually everywhere…
Data Mining Steps
• Define/understand purpose
Obtain data (may involve random sampling)
Explore, clean, pre-process data
Reduce the data; if supervised DM, partition it
Specify task (classification, clustering, etc.)
Choose the techniques (regression, CART, neural networks, etc.)
Iterative implementation and “tuning”
Assess results – compare models Deploy best model
Data Mining Methods: Classification
• Most frequently used DM method
• Part of the machine-learning family
• Employ supervised learning
• Learn from past data, classify new data
• The output variable is categorical (nominal or ordinal) in nature
• Classification versus regression?
• Classification versus clustering?
Assessment Methods for Classification
Predictive accuracy
• Hit rate Speed
• Model building versus predicting/usage speed
Robustness
Scalability
Interpretability
• Transparency, explainability
Accuracy of Classification Models
Acc = (TP + TN)/ TP+ TN+FP+FN
True PosRate = TP/ TP+FN
TrueNeg Rate = TN/TN+FP
Precision = TP/ TP+FP
Recall = TP/TP+FN
Estimation Methodologies for Classification: Single/Simple Split
Simple split (or holdout or test sample estimation)
• Split the data into 2 mutually exclusive sets: training (~70%) and testing (30%)
– For Neural Networks, the data is split into three subsets (training [~60%], validation [~20%], testing [~20%])
Estimation Methodologies for Classification: k-Fold Cross Validation (rotation estimation)
Data is split into k mutual subsets and k number training/testing experiments are conducted
Additional Estimation Methodologies for Classification
Leave-one-out
• Similar to k-fold where k = number of samples
Bootstrapping
• Random sampling with replacement
Jackknifing
Similar to leave-one-out
Area Under the ROC Curve (AUC)
• ROC: receiver operating characteristics (a term borrowed from radar image processing)
Data Exploration – Sampling
Data mining typically deals with huge databases
Sampling required for piloting/prototyping
Algorithms and models are applied to a sample to produce statistically-valid results
final model, you use it to “score” (predict values or classes for) the observations in the larger database
Data Exploration – Over Sampling
Often the event of interest is rare
Examples: response to mailing, fraud in taxes, …
Sampling may yield too few “interesting” cases to effectively train a model
A popular solution: oversample the rare cases to obtain a more balanced training set
Later, need to adjust results for the oversampling
Data Mining – Explore, clean, pre-process data
Determine the types of pre-processing needed, and algorithms used
Main distinction:
Categorical vs. numeric • • •
Numeric
Continuous
Integer
Categorical • •
Ordered (low, medium, high)
Unordered (male, female
Data Mining – Variable Handling
Numeric
Most algorithms can handle numeric data
• May occasionally need to “bin” into categories •
Categorical
Naïve Bayes can use as-is
• In most other algorithms, must create binary dummies (number of dummies = number of categories – 1) [see Table 2.6 for R code]
Data Mining Mistakes
• Selecting the wrong problem for data mining
• Ignoring what your sponsor thinks data mining is and what it really can/cannot do
• Beginning without the end in mind
• Not leaving sufficient time for data acquisition, selection, and preparation
• Looking only at aggregated results and not at individual records/predictions