Classification Models I

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/21

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

22 Terms

1
New cards

classification problems in marketing

• Who are the target segment?

• Who are the profitable consumers?

• Will this person like this movie (for movie recommendation)? (e.g., Netflix)

• Others) Spam email? Potential customers who will open the bank account?

• Commonly used models:

• Linear Discriminant Analysis; Naïve Bayes, Decision Trees, Random

Forest, Neural Networks, Support Vector Machine, etc.

2
New cards

machine learning

  • explores the construction of algorithms that can learn from and make predictions on data

  • how the response variable varies depending on the value of given predictors

3
New cards

supervised learning

  • machine learning task of inferring a function from LABELED training data

  • classification: inputs are divided into two or more categories, and the learner assigns unseen inputs to one category by using model. This is typically tackled in a supervised way. --- Spam filtering: the inputs are emails and the categories are "spam" and "not spam"

4
New cards

supervised learning classification methods

Decision Trees, Ensembles (Bagging, Boosting, Random Forest), Logistic Regression, Support Vector Machine.

5
New cards

unsupervised learning

  • NO LABELS are given to the learning algorithm, leaving it on its own to find structure in its input -- discovering hidden patterns (latent structure) in data (wiki).

  • clustering

6
New cards

clustering

  • unsupervised learning

  • a set of inputs is to be divided into groups (or segments). Unlike classification, the groups are not known beforehand (i.e., no label information), making this typically an unsupervised task – Searching hidden structure.

  • Hierarchical Clustering, K-means method, Model-based Clustering, etc.

7
New cards

Classification and Regression Tree (CART) Advantages

  • a single decision tree

  • Computationally simple and quick to fit, even for large problems.

  • Automatic variable selection.

  • Very easy to interpret (if the tree is small).

  • Tree picture provides valuable insights (Intuitive).

  • Terminal nodes suggest a clustering of data

8
New cards

Classification and Regression Tree (CART) Disadvantages

• Accuracy – relatively lower than other tree methods (e.g., Random forest, etc.)

• Instability – if changing the data a little, the tree picture can change a lot (especially, if the first splitting is changed).

• Thus, practically boosting or ensemble models are used more.

9
New cards

CART

There is Y (target DV to be classified) and related X as classifiers.

• We denote the feature space by X.

• Tree structured classifiers are constructed by repeated splits of the space X into smaller and smaller subsets, beginning with X itself.

• Definitions: parent node, child node, terminal (leaf) node – see next slides.

10
New cards

impurity

Impurity measures how mixed a node is.

  • Pure node → all observations in one class

  • Impure node → mixed classes

Goal: maximize purity after each split

<p>Impurity measures <strong>how mixed a node is</strong>.</p><ul><li><p>Pure node → all observations in one class</p></li><li><p>Impure node → mixed classes</p></li></ul><p>Goal: <strong>maximize purity</strong> after each split</p>
11
New cards

entropy

common way to measure impurity

Key cases:

  • Entropy = 0 → perfectly pure

  • Entropy = 1 → 50/50 split (max impurity)

Used to evaluate how good a split is

<p>common way to measure impurity </p><p>Key cases:</p><ul><li><p><strong>Entropy = 0</strong> → perfectly pure</p></li><li><p><strong>Entropy = 1</strong> → 50/50 split (max impurity)</p></li></ul><p>Used to evaluate how good a split is</p>
12
New cards

information gain

  • want to determine which attribute is most useful for discriminating between the classes to be learned

  • tells us how important a given attribute of the feature vectors is

13
New cards

information gain formula

entropy(parent) - [average entropy (children)]

<p>entropy(parent) - [average entropy (children)]</p>
14
New cards

CART selects what variables?

CART automatically selects variables that:

  • Reduce impurity the most (highest information gain)

  • Improve classification accuracy

ex: duration previous

prediction hit ratio- look at root node error

15
New cards

prediction hit ratio in CART

look at root node error

ex: 3488/30891= 0.11291

error rate= 0.11291

HIT RATIO(accuracy) = 1 - error rate

1- 0.11291 = .88709

88.7% prediction hit ratio

<p>look at root node error </p><p>ex: 3488/30891= 0.11291 </p><p>error rate= 0.11291 </p><p>HIT RATIO(accuracy) = 1 - error rate </p><p>1- 0.11291 = .88709 </p><p>88.7%  prediction hit ratio </p><p></p>
16
New cards

best attribute CART

• The one which will result in the smallest tree

• Heuristic: choose the attribute that produces the “purest” nodes

17
New cards

linear discriminant analysis

[a classic method] As a possible example, a predictor (X1) of “Business Application” attribute might be able to classify students between the “Business Analytics” group and the “statistics/engineering” group

18
New cards

LDA goal

to find the discriminant function Z (e.g., a linear combination of the predictors) that leads to an optimal division of the groups

19
New cards

LDA models

• Predictor variables are normally distributed (i.e., Multivariate Normal Distribution) – if not, we can consider other methods (e.g., logistic regression, tree models).

• That is, predictors for LDA should be continuous variables. – usually people transform the data (i.e., scale function in R).

20
New cards

LDA function

derives the linear combination of 2 (or more) independent variables that will discriminate between discrete groups

  • linear combination aka discriminant function or axis) takes the following form (in image)

<p>derives the linear combination of 2 (or more) independent variables that will discriminate between discrete groups </p><ul><li><p>linear combination aka discriminant function or axis) takes the following form (in image) </p></li></ul><p></p>
21
New cards
<p>linear combination</p><p> formula </p>

linear combination

formula

The discriminant weights (bi) are chosen to maximize the ratio of the between-group variance relative to the within- group variance

<p>The discriminant weights (bi) are chosen to maximize the ratio of the between-group variance relative to the within- group variance</p>
22
New cards

quadratic discriminant analysis

provides a non-linear quadratic decision boundary. when the decision boundary is moderately non-linear, QDA may give better results