ISDS EXAM 2 PREP

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/38

There's no tags or description

Looks like no tags are added yet.

Last updated 6:27 AM on 3/18/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

39 Terms

New cards

kNN

algiruh used for classification (of a catgorical outcome) or preddiction (of a numerical response)
data-driven, not model-driven (no need to fit a model like linerar regression)
- relies on finding “similar records in training
make no assumptions about the data

New cards

basic idea: categorcial outcome

for a given record to be classified, identify nearby records-each record consists of an outcome (class membership) and predictor values
“near” means records with similar predictor values x1, x2, …..xp
classify the record as whatver the predominant class is among the nearby records (the “neighbors”)

New cards

euclidean distance

the most popular distance measure
- two recors with predictor values (x1,x2) and (u1,u2) has a an euclidan distance
sometimes predictors need to be standrdized to equalize scales before computing distances

New cards

euclidean ditance formula

square root (x1-u1)²+(x2-u2)²+(xp-up)²

New cards

how to selecy k

simplesy case: k=1 (one-nearest neighbor)
1. classify a new record as belonging to the same class as its nearest neighbor (lowest euclidean distance)
can be extended to k>1 as follows:
1. find the nearest k neigbohrs
2. use a majoruty decison rule to classify a record, where the record is classified as a member of the majority class of the k neigbbohrs
the k can be as large as trianng sample size

New cards

low k

low values of k (1,3,…) capture local structure in data (but also noise)

New cards

high k

high values of k provide more smoothing, less noise but may muss local structure
the extrme case kf k=N (i.e. the entire triaing data set)
- the smae as the “navive rule” (classify all records accoriding to the majority class

New cards

data drive way to select k

prepare a grid of values for k
usually 1,2,….,n/10 (up to 10% of training data)
for 2-class classifcat, skip even values if n is not so large. use 1,3,4,….n/10 instead
select the k that performs the best validation data

New cards

cut-off value

classifcation based on majority decision rule
- “majority” is linked closley to cut-off value
proportion o k neigbors that belong to a class is an estumate of the probability of a new record belonging to that class

New cards

using k-nn for preduction (for numerical outocme)

instead of “majority vote determines class” use average or weighted average of response values from the k nearest neigbjors to dertmine predicted value in the validation set
- often use wieghted average, weight decreaisng with distance
- RSME or average error metric usually used instead of the misclassifcation error rate to choose k

New cards

advantages of knn

simple and intutive
no assumptions required about data
can be very powerful with a large trianing set

New cards

knn shortcomings

required size of training set icnreases exponentially with # of predictiors, p
in large trianing set, it takes a long tim to find distances to all the neighbors and then identify the nearest one(s)

New cards

tree and rule

goals: classify or predict an outcome bases on a set of predictiors- classifcation and regression trees
- the output is a set of rule-divides observation into subrgoups bases on predictor values (tree diagram)
- data drive flexible and no model needed
- computationally cheap to deploy on large data, robuts to outliers and can handle mising values

New cards

recursive partiotioning

repreatedly split the records into two prtes based on predtcor values so as to achive mximum homogeniety within the new parts

New cards

pruning the trees

simplify the tree by pruning peripheral branches to avoid overfitting

New cards

recursive partioning steps

pick on of the predictor varaibles, xi
pick a value of xi, say si, that dived the training data into two (not necessarily equal) portions
measure how “pure” or homogeneous each of the resulting portions are
1. “pure”=conatining records of mosly one class
algorithm tries different values of xi and si to maximize purity in intial split
after you get a “maximum purity” split repeat the process of a second split and so on

New cards

measuring impurity

introduce metrics for determining the homogenity of the resultiong subgroups of observations
numbers of ways to measure impurity, poplar measures being
- gini index
- entropy measure

New cards

impurity measures

gini index and entropy are both based on the proportion of cases in a rectangle that belong to each class
- minimum value=0 when all cases belong to the same class (most pure)
- max value when all classes are equally represented (=0.50 in binary case for gini) and log2(m) for entropy, where m is the total number of classes of y

New cards

overall impurity measure

weighted avergae of impurity from inidvidual rectangles, weights being th eproportion of cases in each rectangle

New cards

impurity and recursive partioning

at each successive stage, compare this measures across all possible splits in all variables
choose the split that reduces impurity the most
chosen split points become nodes on the tree

New cards

tree structure

split points become nods on tree (circle with split value in center)
rectangles represent “leaves” (termianl points, no further spliits, classifcation value noted)
number on lines between nodes indicate # cases

New cards

determining leaf node label

each leaf node label is determined by “voting” of the records within it and by the cutoff value
records within each leaf node are from the training data
default cutoff=0.5 means that the leaf node’s label is the majority class
cutoff=0.75: reuries majority of 75% or more “1” records in the leaf to label it a “1” node

New cards

classifying a new observation

a new observation is “dropped” down the tree by comparing its predcitor values with the split points
when it has dropped to the terminal leaf, assign it to a class
- majority voting method used with treaning data that belonges to the node during tree constriction
- exmple: a new observation reaching the rightmmost leaf node will be classified as “owner”

New cards

performance evaluation

partition the data insto traiing and validation sets
- training set used to grow tree
- validation set used to asses classifcation performance
- classifcatin matrices, life-charts
more than 2 classes (m>2)
- same structure except that the terminsal nodes would take on of the m-class labels
- similar impurity measures

New cards

overfitting: stopping tree growth

natural end of process is 100% purity in each leaf
this iverfits the data, which end up fitting noise in the data
overfitting leads to low predictive accuract of new data
past a certain point, the error rate for the validation data starts to increase

New cards

CHAID (chi squared automatic interaction detection)

uses chi sqaure test for independence (from statistics) to limit tree growth
- splitting stops when purity improvemnt is not statistically isgnifcant (measured by p-value)
- more suitable for categorical variables but can be adoated to continous predictrs by binning

New cards

pruning

CART lets tree grow to full extent, then prunes it back
idea is to find that point at whihc the validation error begins to rise
- helps recognize a very large tree that is likley to overfit the data and the wekeast brances (that hardly reduces the eror rate-hence cpauture just noice) should be removed
- trade-off between misslcalssifcation error in the validation set and the number of decion nodes in the pruned tree
generate succsssively smaller tress by pruning leaves
at each pruning stage mutliple trees are possible

New cards

minimum error tree

has lowes error rate on validation data

New cards

bes pruned tree

smalles tree within one standard error of minimum error tree
this addsa a bonus for simplicity/parsimony since it has less number fo decison nodes than the minimum error tree

New cards

regression trees for prediction

used with continous outcome variable
procedure similar to classifcation tree
- many splits attempted, choose the one that minimizes impurity
prediction is computed as the average of numercal target varibale in the rectangle (in CT it is a majority vote)
impurity measued by sum of squared deviation from leaf mean
performance measured by RSME (root mean squared error

New cards

regression tree advanatges

easy to use, understand and computationally cheap to deploy on large samples
produce rles that are easiy to interprett and implemnt
varabile selection and reduction is automatic
do no require the assumptions of statsitical model

New cards

regression treee disadvantage

usually reuqires large amounts of data
may not perform well where there is structure in the data that is not perfomed well by horizontal or vertical splits

New cards

logistic regression

powerful model based classifcation tool
extends idea if linear regression to situation where outcome varibale is categirical
- model relates proeductors with the outcome
we focus on binary classifatcion (y=0, y=1) but predictors can be categorcaol and continous
widely used, particulary where a structured model is useful

New cards

odds of an event formula

p/(1-p)
p=odds/(1+odds)

New cards

maximum liklehood estimation

estimates of beta are derived trhough an iterative process

New cards

varaible selction problems

as in linear regresio, correlated predictors introduced bias in the method
- overly complex models have the danger of overfitting

New cards

variable selction soution

remove extreme redundancesies by droppinng predictors vaiia automated selection of variables sunsets (like linear regressions) or by data reduction methods

New cards

P-values of predictors