1/21
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
classification problems in marketing
• Who are the target segment?
• Who are the profitable consumers?
• Will this person like this movie (for movie recommendation)? (e.g., Netflix)
• Others) Spam email? Potential customers who will open the bank account?
• Commonly used models:
• Linear Discriminant Analysis; Naïve Bayes, Decision Trees, Random
Forest, Neural Networks, Support Vector Machine, etc.
machine learning
explores the construction of algorithms that can learn from and make predictions on data
how the response variable varies depending on the value of given predictors
supervised learning
machine learning task of inferring a function from LABELED training data
classification: inputs are divided into two or more categories, and the learner assigns unseen inputs to one category by using model. This is typically tackled in a supervised way. --- Spam filtering: the inputs are emails and the categories are "spam" and "not spam"
supervised learning classification methods
Decision Trees, Ensembles (Bagging, Boosting, Random Forest), Logistic Regression, Support Vector Machine.
unsupervised learning
NO LABELS are given to the learning algorithm, leaving it on its own to find structure in its input -- discovering hidden patterns (latent structure) in data (wiki).
clustering
clustering
unsupervised learning
a set of inputs is to be divided into groups (or segments). Unlike classification, the groups are not known beforehand (i.e., no label information), making this typically an unsupervised task – Searching hidden structure.
Hierarchical Clustering, K-means method, Model-based Clustering, etc.
Classification and Regression Tree (CART) Advantages
a single decision tree
Computationally simple and quick to fit, even for large problems.
Automatic variable selection.
Very easy to interpret (if the tree is small).
Tree picture provides valuable insights (Intuitive).
Terminal nodes suggest a clustering of data
Classification and Regression Tree (CART) Disadvantages
• Accuracy – relatively lower than other tree methods (e.g., Random forest, etc.)
• Instability – if changing the data a little, the tree picture can change a lot (especially, if the first splitting is changed).
• Thus, practically boosting or ensemble models are used more.
CART
There is Y (target DV to be classified) and related X as classifiers.
• We denote the feature space by X.
• Tree structured classifiers are constructed by repeated splits of the space X into smaller and smaller subsets, beginning with X itself.
• Definitions: parent node, child node, terminal (leaf) node – see next slides.
impurity
Impurity measures how mixed a node is.
Pure node → all observations in one class
Impure node → mixed classes
Goal: maximize purity after each split

entropy
common way to measure impurity
Key cases:
Entropy = 0 → perfectly pure
Entropy = 1 → 50/50 split (max impurity)
Used to evaluate how good a split is

information gain
want to determine which attribute is most useful for discriminating between the classes to be learned
tells us how important a given attribute of the feature vectors is
information gain formula
entropy(parent) - [average entropy (children)]
![<p>entropy(parent) - [average entropy (children)]</p>](https://knowt-user-attachments.s3.amazonaws.com/1fa74742-3b31-4570-9450-11f08475baf4.png)
CART selects what variables?
CART automatically selects variables that:
Reduce impurity the most (highest information gain)
Improve classification accuracy
ex: duration previous
prediction hit ratio- look at root node error
prediction hit ratio in CART
look at root node error
ex: 3488/30891= 0.11291
error rate= 0.11291
HIT RATIO(accuracy) = 1 - error rate
1- 0.11291 = .88709
88.7% prediction hit ratio

best attribute CART
• The one which will result in the smallest tree
• Heuristic: choose the attribute that produces the “purest” nodes
linear discriminant analysis
[a classic method] As a possible example, a predictor (X1) of “Business Application” attribute might be able to classify students between the “Business Analytics” group and the “statistics/engineering” group
LDA goal
to find the discriminant function Z (e.g., a linear combination of the predictors) that leads to an optimal division of the groups
LDA models
• Predictor variables are normally distributed (i.e., Multivariate Normal Distribution) – if not, we can consider other methods (e.g., logistic regression, tree models).
• That is, predictors for LDA should be continuous variables. – usually people transform the data (i.e., scale function in R).
LDA function
derives the linear combination of 2 (or more) independent variables that will discriminate between discrete groups
linear combination aka discriminant function or axis) takes the following form (in image)


linear combination
formula
The discriminant weights (bi) are chosen to maximize the ratio of the between-group variance relative to the within- group variance

quadratic discriminant analysis
provides a non-linear quadratic decision boundary. when the decision boundary is moderately non-linear, QDA may give better results