ISDS EXAM 2 PREP

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/38

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 6:27 AM on 3/18/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

39 Terms

1
New cards

kNN

  • algiruh used for classification (of a catgorical outcome) or preddiction (of a numerical response)

  • data-driven, not model-driven (no need to fit a model like linerar regression)

    • relies on finding “similar records in training

  • make no assumptions about the data

2
New cards

basic idea: categorcial outcome

  • for a given record to be classified, identify nearby records-each record consists of an outcome (class membership) and predictor values

  • “near” means records with similar predictor values x1, x2, …..xp

  • classify the record as whatver the predominant class is among the nearby records (the “neighbors”)

3
New cards

euclidean distance

  • the most popular distance measure

    • two recors with predictor values (x1,x2) and (u1,u2) has a an euclidan distance

  • sometimes predictors need to be standrdized to equalize scales before computing distances

4
New cards

euclidean ditance formula

square root (x1-u1)2+(x2-u2)2+(xp-up)2

5
New cards

how to selecy k

  1. simplesy case: k=1 (one-nearest neighbor)

    1. classify a new record as belonging to the same class as its nearest neighbor (lowest euclidean distance)

  2. can be extended to k>1 as follows:

    1. find the nearest k neigbohrs

    2. use a majoruty decison rule to classify a record, where the record is classified as a member of the majority class of the k neigbbohrs

  3. the k can be as large as trianng sample size

6
New cards

low k

low values of k (1,3,…) capture local structure in data (but also noise)

7
New cards

high k

  • high values of k provide more smoothing, less noise but may muss local structure

  • the extrme case kf k=N (i.e. the entire triaing data set)

    • the smae as the “navive rule” (classify all records accoriding to the majority class

8
New cards

data drive way to select k

  1. prepare a grid of values for k

  2. usually 1,2,….,n/10 (up to 10% of training data)

  3. for 2-class classifcat, skip even values if n is not so large. use 1,3,4,….n/10 instead

  4. select the k that performs the best validation data

9
New cards

cut-off value

  • classifcation based on majority decision rule

    • “majority” is linked closley to cut-off value

  • proportion o k neigbors that belong to a class is an estumate of the probability of a new record belonging to that class

10
New cards

using k-nn for preduction (for numerical outocme)

  • instead of “majority vote determines class” use average or weighted average of response values from the k nearest neigbjors to dertmine predicted value in the validation set

    • often use wieghted average, weight decreaisng with distance

    • RSME or average error metric usually used instead of the misclassifcation error rate to choose k

11
New cards

advantages of knn

  1. simple and intutive

  2. no assumptions required about data

  3. can be very powerful with a large trianing set

12
New cards

knn shortcomings

  1. required size of training set icnreases exponentially with # of predictiors, p

  2. in large trianing set, it takes a long tim to find distances to all the neighbors and then identify the nearest one(s)

13
New cards

tree and rule

  • goals: classify or predict an outcome bases on a set of predictiors- classifcation and regression trees

    • the output is a set of rule-divides observation into subrgoups bases on predictor values (tree diagram)

    • data drive flexible and no model needed

    • computationally cheap to deploy on large data, robuts to outliers and can handle mising values

14
New cards

recursive partiotioning

repreatedly split the records into two prtes based on predtcor values so as to achive mximum homogeniety within the new parts

15
New cards

pruning the trees

simplify the tree by pruning peripheral branches to avoid overfitting

16
New cards

recursive partioning steps

  1. pick on of the predictor varaibles, xi

  2. pick a value of xi, say si, that dived the training data into two (not necessarily equal) portions

  3. measure how “pure” or homogeneous each of the resulting portions are

    1. “pure”=conatining records of mosly one class

  4. algorithm tries different values of xi and si to maximize purity in intial split

  5. after you get a “maximum purity” split repeat the process of a second split and so on

17
New cards

measuring impurity

  • introduce metrics for determining the homogenity of the resultiong subgroups of observations

  • numbers of ways to measure impurity, poplar measures being

    • gini index

    • entropy measure

18
New cards

impurity measures

  • gini index and entropy are both based on the proportion of cases in a rectangle that belong to each class

    • minimum value=0 when all cases belong to the same class (most pure)

    • max value when all classes are equally represented (=0.50 in binary case for gini) and log2(m) for entropy, where m is the total number of classes of y

19
New cards

overall impurity measure

  • weighted avergae of impurity from inidvidual rectangles, weights being th eproportion of cases in each rectangle

20
New cards

impurity and recursive partioning

  • at each successive stage, compare this measures across all possible splits in all variables

  • choose the split that reduces impurity the most

  • chosen split points become nodes on the tree

21
New cards

tree structure

  • split points become nods on tree (circle with split value in center)

  • rectangles represent “leaves” (termianl points, no further spliits, classifcation value noted)

  • number on lines between nodes indicate # cases

22
New cards

determining leaf node label

  • each leaf node label is determined by “voting” of the records within it and by the cutoff value

  • records within each leaf node are from the training data

  • default cutoff=0.5 means that the leaf node’s label is the majority class

  • cutoff=0.75: reuries majority of 75% or more “1” records in the leaf to label it a “1” node

23
New cards

classifying a new observation

  • a new observation is “dropped” down the tree by comparing its predcitor values with the split points

  • when it has dropped to the terminal leaf, assign it to a class

    • majority voting method used with treaning data that belonges to the node during tree constriction

    • exmple: a new observation reaching the rightmmost leaf node will be classified as “owner”

24
New cards

performance evaluation

  • partition the data insto traiing and validation sets

    • training set used to grow tree

    • validation set used to asses classifcation performance

    • classifcatin matrices, life-charts

  • more than 2 classes (m>2)

    • same structure except that the terminsal nodes would take on of the m-class labels

    • similar impurity measures

25
New cards

overfitting: stopping tree growth

  • natural end of process is 100% purity in each leaf

  • this iverfits the data, which end up fitting noise in the data

  • overfitting leads to low predictive accuract of new data

  • past a certain point, the error rate for the validation data starts to increase

26
New cards

CHAID (chi squared automatic interaction detection)

  • uses chi sqaure test for independence (from statistics) to limit tree growth

    • splitting stops when purity improvemnt is not statistically isgnifcant (measured by p-value)

    • more suitable for categorical variables but can be adoated to continous predictrs by binning

27
New cards

pruning

  • CART lets tree grow to full extent, then prunes it back

  • idea is to find that point at whihc the validation error begins to rise

    • helps recognize a very large tree that is likley to overfit the data and the wekeast brances (that hardly reduces the eror rate-hence cpauture just noice) should be removed

    • trade-off between misslcalssifcation error in the validation set and the number of decion nodes in the pruned tree

  • generate succsssively smaller tress by pruning leaves

  • at each pruning stage mutliple trees are possible

28
New cards

minimum error tree

has lowes error rate on validation data

29
New cards

bes pruned tree

  • smalles tree within one standard error of minimum error tree

  • this addsa a bonus for simplicity/parsimony since it has less number fo decison nodes than the minimum error tree

30
New cards

regression trees for prediction

  • used with continous outcome variable

  • procedure similar to classifcation tree

    • many splits attempted, choose the one that minimizes impurity

  • prediction is computed as the average of numercal target varibale in the rectangle (in CT it is a majority vote)

  • impurity measued by sum of squared deviation from leaf mean

  • performance measured by RSME (root mean squared error

31
New cards

regression tree advanatges

  • easy to use, understand and computationally cheap to deploy on large samples

  • produce rles that are easiy to interprett and implemnt

  • varabile selection and reduction is automatic

  • do no require the assumptions of statsitical model

32
New cards

regression treee disadvantage

  • usually reuqires large amounts of data

  • may not perform well where there is structure in the data that is not perfomed well by horizontal or vertical splits

33
New cards

logistic regression

  • powerful model based classifcation tool

  • extends idea if linear regression to situation where outcome varibale is categirical

    • model relates proeductors with the outcome

  • we focus on binary classifatcion (y=0, y=1) but predictors can be categorcaol and continous

  • widely used, particulary where a structured model is useful

34
New cards

odds of an event formula

  • p/(1-p)

  • p=odds/(1+odds)

35
New cards

maximum liklehood estimation

  • estimates of beta are derived trhough an iterative process

36
New cards

varaible selction problems

  • as in linear regresio, correlated predictors introduced bias in the method

    • overly complex models have the danger of overfitting

37
New cards

variable selction soution

remove extreme redundancesies by droppinng predictors vaiia automated selection of variables sunsets (like linear regressions) or by data reduction methods

38
New cards

P-values of predictors

  • usefule for review to detrmine whethe to include variable in method

  • key in profiling tasts but less ipornatnt in preductuve classifcation

39
New cards

Explore top flashcards

flashcards
APUSH Progressive Era PEDLIGS
84
Updated 907d ago
0.0(0)
flashcards
US History Chapter 10 Test
32
Updated 98d ago
0.0(0)
flashcards
Latin and Greek Roots List 1-3
75
Updated 333d ago
0.0(0)
flashcards
Numbers 1 - 20
20
Updated 528d ago
0.0(0)
flashcards
Business Law - Midterm
126
Updated 186d ago
0.0(0)
flashcards
La inmigración
45
Updated 488d ago
0.0(0)
flashcards
APUSH Progressive Era PEDLIGS
84
Updated 907d ago
0.0(0)
flashcards
US History Chapter 10 Test
32
Updated 98d ago
0.0(0)
flashcards
Latin and Greek Roots List 1-3
75
Updated 333d ago
0.0(0)
flashcards
Numbers 1 - 20
20
Updated 528d ago
0.0(0)
flashcards
Business Law - Midterm
126
Updated 186d ago
0.0(0)
flashcards
La inmigración
45
Updated 488d ago
0.0(0)