Data Science 3

Session 3

 • In this session, we will explore several commonly used data science techniques where the idea is to develop rules, relationships, and models based on predictor information that can be applied to classify outcomes from new and unseen data. 

• Examples: 

– “Which customers are likely to leave the company when their contracts expire?” 

– “Which potential customers are likely not to pay off their account balances?“ 


Predictive Model

A predictive model is a mathematical algorithm that predicts a target variable from a number of explanatory variables 


Classification - Regression


• Most predictive analytics problems can be categorized into either classification or numeric predictions problems 

• In classification or class prediction, we try to use the information from the predictors or independent variables to sort the data samples into 2 or more distinct classes or buckets. 

• In the case of numeric prediction, we try to predict the numerical value of a dependent variable using the values assumed by the independent variables, as is done in a traditional regression modeling.


Classification Examples


 − Credit RiskAssessment 

• Attributes: your age, income, debts, … 

• Class: Are you getting credit by your bank? 

− Marketing 

• Attributes: previously bought products, browsing behavior 

• Class: are you a target customer for a newproduct? 

− SPAMDetection 

• Attributes: words and header fields of an e-mail 

• Class: regular e-mail or spam e-mail? 

− Identifying Tumor Cells 

• Attributes: features extracted from x-rays or MRI scans 

• Class: malignant or benign cells


Rule Induction 

• Rule induction is a data science process of deducing if-then rules from a data set. 

• These symbolic decision rules explain an inherent relationship between the independent variables and target variable in the data set. 

• The easiest way to extract rules from a data set is from a decision tree that is developed on the same data set. 

• A decision tree splits data on every node and leads to the leaf where the class is identified. 

• If we trace back from the leaf to the root node, we can combine all the split conditions to form a distinct rule.


K-Nearest Neighbors (k-NN) 


• ‘Eager learners’ vs ‘Lazy learners’

– ‘Eager learners’: developing a mathematical relationship between the input and target variables.

– Lazy learners: using a look up table to match the input variables and find the outcome 

• Central logic of k-NN : Similar records congregate in a neighborhood in n dimensional space, with the same target class labels. 

• The entire training data set is “memorized” 

• When unlabeled example records need to be classified, the input attributes of the new unlabeled records are compared against the entire training set to find a closest match. 

• The class label of the closest training record is the predicted class label for the unseen test record.


Require 3 things 

1. A set of stored records 

2. A distance measure to compute distance between records 

3. The value of k, the number of nearest neighbors to consider 


• To classify an unknown record: 

1. Compute distance to each training record 

2. Identify k-nearest neighbors 

3. Use class labels of nearest neighbors to determine the class label of unknown record 

• by taking majority vote or 

• by weighing the vote according to distance


Model Performance for a Predictive Model 


• How do I know that my predictive model is any good? 

• Two important types of model evaluation:

– Domain knowledge validation 

• sanity checking by data scientists 

• using model as interface between data scientists and stakeholders

– important to have model that is comprehensible to stakeholders

– can get domain expert (e.g. Product Manager) assessment of model • validate that data mining (training/testing) and use contexts are acceptably similar

– Test set validation


Model Evaluation 


How good is a model at classifying unseen records? 

1. Methods for ModelEvaluation 

How to obtain reliable estimates? 

2. Metrics for Model Evaluation 

How to measure the performance of a model?


1. Methods for Model Evaluation


Test Set – Generalization 


• Evaluation on training data provides no assessment of how well the model generalizes to unseen cases 

• Idea: ”Hold out” some data for which we know the value of the target variable, but which will not be used to build the model 

→”Test set“ 

• Predict the values of the test set with the model and compare them with the actual values → generalization performance 

• Generalization is the property of a model or modeling process whereby the model applies to data that were not used to build the model 

• Data Science needs to create models that generalize beyond training data


Overfitting 


• Overfitting is the tendency of Data Science procedures to tailor models to the training data, at the expense of generalization to previously unseen data points. 

• Trade-off between model complexity and the possibility of over fitting 

• Recognize over fitting and manage complexity in a principled way


Over-fitting in Decision Tree 

• Decision Tree : find important, predictive individual attributes recursively to smaller and smaller data subsets 

• Generally: A procedure that grows trees until the leaves are pure tends to overfit 

• The complexity of a tree lies in the number of nodes 




Avoiding Overfitting for Decision Tree

• Decision Tree will likely result in large, overly complex trees that overfit the data 

• Pre-pruning : Stop growing the tree before it gets too complex 

• Prune back a tree that is too large (reduce its size) 


Cross-Validation 

• Cross-validation is a more sophisticated training and testing procedure 

• Not only a simple estimate of the generalization performance, but also some statistics on the estimated performance (mean, variance, ...) 

• How does the performance vary across datasets? 

– assessing confidence in the performance estimate 

• Cross-validation computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing 


Generalization Performance 

• Different models may have different performance on the same data 

• Different training sets may result in different generalization performance 

• Different test sets may result in different estimates of the generalization performance 

• If the training set size changes, you may also expect different generalization performance of the model


2. Metrics for Model Evaluation


Confusion Matrix for a Binary Classification 

• Focus on the predictive capability of a model

• The confusion matrix counts the correct and false classifications 

• the counts are the basis for calculating different performance metrics


Gains & Lift Charts 

• Gainor Lift is a measure of the effectiveness of a classification model calculated as the ratio between the results obtained with & without a model 

• Gain & lift charts are visual aids for evaluating the performance of a classification model 

• Lift chart evaluates model performance in a portion of the population



robot