intro to ml- regression & decision tree

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/59

flashcard set

Earn XP

Description and Tags

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

60 Terms

1
New cards

Supervised Learning- Input

(𝑥 (1) , 𝑦 (1) ), …, (𝑥 (𝑚) , 𝑦 (𝑚) ) where 𝑥 (𝑖) is the ith data point and 𝑦 (𝑖) is the ith label associated to this data point.

2
New cards

Supervised Learning- Want

find a function f such that 𝑓(𝑥(𝑖)) is a good approximation to 𝑦(𝑖)

3
New cards

Supervised Learning- Goal

Use function f to predict unseen data points x

4
New cards

Supervised learning- X

attributes, predictor variables, or independent variables

5
New cards

Supervised Learning- Y

classes, target variable, or dependent variable

concept to be learned

6
New cards

hypothesis space

set of functions f:X→Y

7
New cards

Classification

  • Learning how to separate/differentiate two or more classes (discrete or categorical target).

  • Learning the separating boundary

8
New cards

Classification Examples

Fraud/Not-Fraud, Spam/Not-Spam, Sentiment Analysis, Image Classification (fruits, cars, etc.)

9
New cards

Regression

Learning how to predict a continuous numerical target

10
New cards

Regression Examples

predicting temperature or house/stock prices

11
New cards

Linear Regression- Input

(𝑥 (1) , 𝑦 (1) ), …, (𝑥 (𝑚) , 𝑦 (𝑚) ) where 𝑥 (𝑖) ∈ ℝ and 𝑦 (𝑖) ∈ ℝ.

12
New cards

Linear Regression- Hypothesis Space

set of linear functions 𝑓 𝑥 = 𝑎𝑥 + b, where 𝑎 ∈ ℝ and 𝑏 ∈ ℝ

13
New cards

Linear Regression- Goal

find a function f such that 𝑓 𝑋 ≈ Y, minimizing the error metric

14
New cards

Linear Regression- Error

computed based on the difference between predicted and actual values

<p>computed based on the difference between predicted and actual values</p>
15
New cards

optimal linear hypothesis

minimizes squared differences between predicted value and actual value

<p>minimizes squared differences between predicted value and actual value</p>
16
New cards

Gradient Descent

iteratively change parameter 𝜃 until reaching the minimum value for 𝐿(𝜃)

<p>iteratively change parameter 𝜃 until reaching the minimum value for 𝐿(𝜃)</p>
17
New cards

Gradient

derivative at point

<p>derivative at point</p>
18
New cards

Gradient Descent Convergence

Gradient Descent will converge for convex functions, but the learning rate choice is crucial for that

19
New cards

Non-convex functions

can get stuck in a local minima

20
New cards

evaluate the learned hypothesis on

test data, not training data to avoid overfitting

21
New cards

Classification Input

(x(1), y(1)),..., (x(m), y(m)) where x(i)∈ℝ and  y(i)∈ℤ

22
New cards

Classification

Want to learn a function f to predict a categorical or discrete target

23
New cards

What is decision tree?

It is a representation of concepts and hypothesis based on a set of Boolean/If-Then rules.The Boolean conditions will determine the outcomes.

24
New cards

Example of Decision Tree

(Outlook=“Sunny”) AND (Humidity=“Normal”) → Yes (you should play tennis)but,

(Outlook=“Sunny”) AND (Humidity=“High”) → No (you shouldn’t play tennis)

25
New cards

internal (non-leaf) node

tests the value of a particular feature (Outlook, Humidity, Wind)

26
New cards

leaf node

specifies a class label / outcome (in this case whether or not you should play tennis)

27
New cards

DT’s classify instances..

by sorting them down the tree from the root to some leaf node, which provides the classification of the instance

28
New cards

Each node in the tree...

specifies a test of some attribute of the instance

29
New cards

Each branch descending from that node

corresponds to one of the possible values for this attribute

30
New cards

Classification process

An instance is classified by starting at the root of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example. This process is then repeated for the subtree rooted at the new node

31
New cards

Induction Process

Inferring general rules from specific data

32
New cards

Deductive Process

Using general rules to reach specific conclusions

33
New cards

Decision trees divide

the feature space into axis parallel rectangles

<p>the feature space into axis parallel rectangles</p>
34
New cards

DT Algorithm

Recursive and Greedy Algorithm

35
New cards

Basic DT Algorithm

(i) Select an attribute

(ii) Split the data based on the values of this attribute (iii) Recurse over each new partition

36
New cards

How to decide which attribute?

Informally, we want that split that gives maximum purity at each node i.e., split such that all (or largest part of) instances belong to a single class

37
New cards

Entropy

  • Entropy measures the degree of disorder, impurity or randomness in the system.

  • It is equivalent to say that entropy measures the homogeneity of a decision tree node.

  • Entropy is maximized for uniform distributions

  • Entropy is minimized for distributions that place all their probability on a single outcome

<ul><li><p>Entropy measures the degree of disorder, impurity or randomness in the system. </p></li><li><p>It is equivalent to say that entropy measures the homogeneity of a decision tree node. </p></li><li><p>Entropy is maximized for uniform distributions </p></li><li><p>Entropy is minimized for distributions that place all their probability on a single outcome</p></li></ul><p></p>
38
New cards

for a binary random variable

Entropy is maximum at p=0.5, Entropy is zero at p=0 or p=1

39
New cards

Conditional Entropy

Conditional Entropy is the amount of information needed to quantify the random variable 𝑌given the random variable X. In practice, it tells you the uncertainty/impurity about a variable Y when another variable X is known

<p>Conditional Entropy is the amount of information needed to quantify the random variable 𝑌given the random variable X. In practice, it tells you the uncertainty/impurity about a variable Y when another variable X is known</p>
40
New cards

Information Gain

Informally, IG measures how good an attribute is for splitting heterogeneous group into homogeneous groups. Larger information gain corresponds to less uncertainty about 𝑌 given X

<p>Informally, IG measures how good an attribute is for splitting heterogeneous group into homogeneous groups. Larger information gain corresponds to less uncertainty about 𝑌 given X</p>
41
New cards

select an attribute..

with largest IG

42
New cards

Where to stop?

  • When the node is pure

  • When you run out of the attributes to recurse on

  • When there is no more data points in the node (empty)

The class corresponding to a leaf is determined by majority vote

43
New cards

Discretization

form an ordinal categorical attribute.

  • Equal interval bucketing

  • Equal frequency bucketing (percentiles)

44
New cards

Binary Decision (𝑥𝑘< 𝑡 or 𝑥𝑘 ≥ 𝑡)

consider all possible splits and find the best threshold t

  • More intensive computationally intensive, but do not need to test every threshold (just take the splits in between each consecutive pair of data points of same class)

  • The learning algorithm maximizes over all attributes and all possible thresholds of the real-valued attributes

45
New cards

Handling training examples with missing attribute values

  • Assign the most common values in the overall data

  • Assign the most common value based on other examples with same target value

46
New cards

Overfitting

On small datasets some splits may perfectly fit to the training data, but they won’t generalize well on unseen/unknown datasets

47
New cards

Sources of Overfitting

Noise, Small number of examples in leaves

48
New cards

DT model should generalize well

it should work well on examples which you have not seen before

49
New cards

Train / Test Split

Use training part to learn the model.

Use test to evaluate performance and validate the model.

50
New cards

Pre-pruning or Early Stopping

Stop the algorithm (recursion) before it generates a fully grown tree, or before it reaches a base case (pure or empty node)

51
New cards

Criteria to early stopping

  • Minimum number of samples in parents (before splitting)

  • Minimum number of samples on leaf nodes

  • Maximum depth of a tree

  • Minimum IG

52
New cards

Post Pruning

Grow full tree, then remove branches or nodes to improve the model’s ability to generalize

53
New cards

Reduced error pruning

remove branches that do not affect the performance on test/validation data

greedily remove one that improves validation set accuracy

54
New cards

Minimum impurity decrease

remove nodes if the decrease in impurity is lower that a given threshold

55
New cards

Handling attributes with high cardinality (many values)

penalize the attribute A by incorporating a term called “Split Information”

56
New cards

Split Information

sensitive to how broadly and uniformly the attribute splits the data

where 𝑝(𝑋) = 𝑥 corresponds to the proportion of the data such that value x is assigned to attribute X.

<p>sensitive to how broadly and uniformly the attribute splits the data</p><p>where 𝑝(𝑋) = 𝑥 corresponds to the proportion of the data such that value x is assigned to attribute X.</p>
57
New cards

Gain Ratio

use instead of IG

<p>use instead of IG </p>
58
New cards

binary tree properties

  • A complete binary tree of height h has 2h+1 -1 nodes

  • Number of nodes at depth d is 2d

59
New cards

Decision Tree Properties

  • Due to the complexity inherent to decision tree structures, it usually overfits (pruning is always a good idea)

  • Simple models that explain the data are usually preferred over more complicated ones

  • Finding the small tree that explains the concept to be learned as good as a larger one is known to be an NP-hard problem

  • Greedy heuristic: pick the best attribute (highest IG) at each stage

60
New cards

Common DT methods

ID3 (Iterative Dichotomiser 3)

CART (Classification And Regression Trees)

C4.5 (improvements over ID3)

CHAID (Chi Squared Automatic Interaction Detection)