1/38
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
kNN
algiruh used for classification (of a catgorical outcome) or preddiction (of a numerical response)
data-driven, not model-driven (no need to fit a model like linerar regression)
relies on finding “similar records in training
make no assumptions about the data
basic idea: categorcial outcome
for a given record to be classified, identify nearby records-each record consists of an outcome (class membership) and predictor values
“near” means records with similar predictor values x1, x2, …..xp
classify the record as whatver the predominant class is among the nearby records (the “neighbors”)
euclidean distance
the most popular distance measure
two recors with predictor values (x1,x2) and (u1,u2) has a an euclidan distance
sometimes predictors need to be standrdized to equalize scales before computing distances
euclidean ditance formula
square root (x1-u1)2+(x2-u2)2+(xp-up)2
how to selecy k
simplesy case: k=1 (one-nearest neighbor)
classify a new record as belonging to the same class as its nearest neighbor (lowest euclidean distance)
can be extended to k>1 as follows:
find the nearest k neigbohrs
use a majoruty decison rule to classify a record, where the record is classified as a member of the majority class of the k neigbbohrs
the k can be as large as trianng sample size
low k
low values of k (1,3,…) capture local structure in data (but also noise)
high k
high values of k provide more smoothing, less noise but may muss local structure
the extrme case kf k=N (i.e. the entire triaing data set)
the smae as the “navive rule” (classify all records accoriding to the majority class
data drive way to select k
prepare a grid of values for k
usually 1,2,….,n/10 (up to 10% of training data)
for 2-class classifcat, skip even values if n is not so large. use 1,3,4,….n/10 instead
select the k that performs the best validation data
cut-off value
classifcation based on majority decision rule
“majority” is linked closley to cut-off value
proportion o k neigbors that belong to a class is an estumate of the probability of a new record belonging to that class
using k-nn for preduction (for numerical outocme)
instead of “majority vote determines class” use average or weighted average of response values from the k nearest neigbjors to dertmine predicted value in the validation set
often use wieghted average, weight decreaisng with distance
RSME or average error metric usually used instead of the misclassifcation error rate to choose k
advantages of knn
simple and intutive
no assumptions required about data
can be very powerful with a large trianing set
knn shortcomings
required size of training set icnreases exponentially with # of predictiors, p
in large trianing set, it takes a long tim to find distances to all the neighbors and then identify the nearest one(s)
tree and rule
goals: classify or predict an outcome bases on a set of predictiors- classifcation and regression trees
the output is a set of rule-divides observation into subrgoups bases on predictor values (tree diagram)
data drive flexible and no model needed
computationally cheap to deploy on large data, robuts to outliers and can handle mising values
recursive partiotioning
repreatedly split the records into two prtes based on predtcor values so as to achive mximum homogeniety within the new parts
pruning the trees
simplify the tree by pruning peripheral branches to avoid overfitting
recursive partioning steps
pick on of the predictor varaibles, xi
pick a value of xi, say si, that dived the training data into two (not necessarily equal) portions
measure how “pure” or homogeneous each of the resulting portions are
“pure”=conatining records of mosly one class
algorithm tries different values of xi and si to maximize purity in intial split
after you get a “maximum purity” split repeat the process of a second split and so on
measuring impurity
introduce metrics for determining the homogenity of the resultiong subgroups of observations
numbers of ways to measure impurity, poplar measures being
gini index
entropy measure
impurity measures
gini index and entropy are both based on the proportion of cases in a rectangle that belong to each class
minimum value=0 when all cases belong to the same class (most pure)
max value when all classes are equally represented (=0.50 in binary case for gini) and log2(m) for entropy, where m is the total number of classes of y
overall impurity measure
weighted avergae of impurity from inidvidual rectangles, weights being th eproportion of cases in each rectangle
impurity and recursive partioning
at each successive stage, compare this measures across all possible splits in all variables
choose the split that reduces impurity the most
chosen split points become nodes on the tree
tree structure
split points become nods on tree (circle with split value in center)
rectangles represent “leaves” (termianl points, no further spliits, classifcation value noted)
number on lines between nodes indicate # cases
determining leaf node label
each leaf node label is determined by “voting” of the records within it and by the cutoff value
records within each leaf node are from the training data
default cutoff=0.5 means that the leaf node’s label is the majority class
cutoff=0.75: reuries majority of 75% or more “1” records in the leaf to label it a “1” node
classifying a new observation
a new observation is “dropped” down the tree by comparing its predcitor values with the split points
when it has dropped to the terminal leaf, assign it to a class
majority voting method used with treaning data that belonges to the node during tree constriction
exmple: a new observation reaching the rightmmost leaf node will be classified as “owner”
performance evaluation
partition the data insto traiing and validation sets
training set used to grow tree
validation set used to asses classifcation performance
classifcatin matrices, life-charts
more than 2 classes (m>2)
same structure except that the terminsal nodes would take on of the m-class labels
similar impurity measures
overfitting: stopping tree growth
natural end of process is 100% purity in each leaf
this iverfits the data, which end up fitting noise in the data
overfitting leads to low predictive accuract of new data
past a certain point, the error rate for the validation data starts to increase
CHAID (chi squared automatic interaction detection)
uses chi sqaure test for independence (from statistics) to limit tree growth
splitting stops when purity improvemnt is not statistically isgnifcant (measured by p-value)
more suitable for categorical variables but can be adoated to continous predictrs by binning
pruning
CART lets tree grow to full extent, then prunes it back
idea is to find that point at whihc the validation error begins to rise
helps recognize a very large tree that is likley to overfit the data and the wekeast brances (that hardly reduces the eror rate-hence cpauture just noice) should be removed
trade-off between misslcalssifcation error in the validation set and the number of decion nodes in the pruned tree
generate succsssively smaller tress by pruning leaves
at each pruning stage mutliple trees are possible
minimum error tree
has lowes error rate on validation data
bes pruned tree
smalles tree within one standard error of minimum error tree
this addsa a bonus for simplicity/parsimony since it has less number fo decison nodes than the minimum error tree
regression trees for prediction
used with continous outcome variable
procedure similar to classifcation tree
many splits attempted, choose the one that minimizes impurity
prediction is computed as the average of numercal target varibale in the rectangle (in CT it is a majority vote)
impurity measued by sum of squared deviation from leaf mean
performance measured by RSME (root mean squared error
regression tree advanatges
easy to use, understand and computationally cheap to deploy on large samples
produce rles that are easiy to interprett and implemnt
varabile selection and reduction is automatic
do no require the assumptions of statsitical model
regression treee disadvantage
usually reuqires large amounts of data
may not perform well where there is structure in the data that is not perfomed well by horizontal or vertical splits
logistic regression
powerful model based classifcation tool
extends idea if linear regression to situation where outcome varibale is categirical
model relates proeductors with the outcome
we focus on binary classifatcion (y=0, y=1) but predictors can be categorcaol and continous
widely used, particulary where a structured model is useful
odds of an event formula
p/(1-p)
p=odds/(1+odds)
maximum liklehood estimation
estimates of beta are derived trhough an iterative process
varaible selction problems
as in linear regresio, correlated predictors introduced bias in the method
overly complex models have the danger of overfitting
variable selction soution
remove extreme redundancesies by droppinng predictors vaiia automated selection of variables sunsets (like linear regressions) or by data reduction methods
P-values of predictors
usefule for review to detrmine whethe to include variable in method
key in profiling tasts but less ipornatnt in preductuve classifcation