1/39
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
True or false : pre-processing is a standardized procedure that is independent of the model that will be used afterwards
False
True of False : One-hot encoding a categorical feature with originally 3 separate categories results in 3 new columns
True
DT root (node)
when we have all our data (befire any preprocessing) has no parent node
DT branches
connections between nodes (collections of data)
DT internal node
had parent and child notes
Tree-based segmentation strategy
the sample is iterativelu split in smaller samples
each time all prossible splits are evaluated
the best possible split is selected
the splitting decision : which two effects of a split are taken into account
the reduction of impurity
the number of observations in the resulting subsamples
the splitting decision : the reduction of impurity
do we improve the homogeneity of the resulting subsamples compared to the homogeneity of the initial sample that is split
how to evaluate impurity (C4,5) gini
how to compare impurity of child nodes with impurity of parent node ?
information gain
weighted decrease in impurity
= impurity parents - sum P child x impurity child
P child = proportion of observations of the parent node in a child node
what is the goal of a predictive model
generalization
generalization
referes to a model’s ability to accurately predict unseen data
goal == finding patterns that egeneralize
(memorization of the training set data does not make a good model =/= learning a representative pattern
how do we test if a model is a good model
with the test set→ measures generalization performance
Q ; will a neural network always lead to better generalization performances than a linear regression ?
we don’t know, a neural network can for sure connect everyting, but sometimes the linear model generlises better → we need to test
model capacity / complexity
ability to remember information about its training data.
usually not a formal term but corresponds roughly to the number of trainable parameters, splits …
→ more complex isn’t always better
overfitting
too high complexity : memorization : i.e. learn the corret answer for every training exaple, but learned pattern doesn’t generalize to novel examples / instances → no generalization
too complex models may fit the noise in the dataset
general concept of too high capacity / complexity
but how to regulate complexity to optimize generalization ?
use holdout data
= creating a “lab test” of generalization performance
how accurate do we expect a model to be
it depends, it is an engineering discipline, you need the environment, the contextualities → the model won’t often work better on the test set than on the training set
the distribution of the trainingset might not be the same as the new data → time predictions
you assume that it should stay around the same, but if the distribution changes, then there could be a degradation
examples of modeling decisions
choosing the best performing model (eg best tree)
choose optimal k in KNN
select features
setting hyperparameters = parameter whose value is used to control the learning process
hyperparameter
= parameter whose value is used to control the learning process
modeling decisions are made on the … data
validation set
training data → overfitting
test data → no longer unbiased estimate of generalization error
golden rule
test set must stay a representative sample for out-of sample evaluation
= lock away the test set until after all modeling decisions have been taken
training data
to train the model → largest subset of total data
validation data
to make modeling decisions
test data
to obtain unbiased estimate of our of sample performance
splitting decision : possible splits that are evalueted
continuous variables : all possible values (<)
categorical variables :
→ nominal : all possible combinations of values (=)
→ ordinal : all possible values (<)
missing values : can be added to the child node to maximize information gain (separate group)
why do we have to stop growing the tree befroe we reach minimum impurity ?
impurity and information gain is based on the trainijng set only, so if you add too much complexity, you start overfitting and the rest won’t fit anymore
stopping decision : possible approaches
information gain above a minimum threshold
maximum siez of the tree :
number of ‘levels’
number of leaves = number of rules = number of segments
minimum number of observations per leaf
the tree should generalize well ie it should accurately predict new, unseen observations BUT
too large tree resulting in a complex decision boudnary = overfitted tree
to decide when to stop growing the tree
we want to maximize generalizatio performance
using the validation set : random subsample of the training data (typically 30%)
which data set do we use to decide when to stop in our decision tree
the validation set
evaluation of the stopping decision
classifocation accuracy : % of correctly predicted observations (also called PPC : percentage correctly classified)
misclassification error : % of incorrectly predicted observations
potential problem with early-stopping
non convex curve
→ solution : pruning
pruning :
grow full tree, with accuracy on the training set = 100%
cut branches to optimize perfromance on validation set
validation set for deciding on the size of the decision tree :
= tuning the decision tree size
remember :
often used in data science : tuning hyperparameters
eg : number of neurons in an artifical neural network, number of trees in an ensemble
assignment decision
simple : majority voting
better : probability to be good / bad
majority voting
class of most of the observations in a leaf
interaction effects
when the relation between predictor A and the target variable depends on the value of another predictor B
→ then there is an interaction between predictors A and B
meaning ; for different subsamples there are other variables whoch explaon the outcome in terms of the target variable
CHAID : CHi squared Automatic Interaction Detection
advantages of decision trees
interpretable : rules
non-parametric
robust with respect to input data
missing values,
outliers,
variable selection
categorical and continous variables
disadvantages of decision trees
sensitive to changes in the training data : weak classifier
different sploit in training and validation set uelds possibly a different tree ; derived relations unstable ?
sensitive to impabalnces class distribution
predictive power
used of decision trees
predictive model ; not recommended
data exporation : no preprocessing required
variable selection in preprocessing step using information gain
segmentation prior development of final predictive moels
segments are found based on interaction, so meaningful segmentation
coarse classification / binning