1/72
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Which method directly gives odds ratios?
Logistic regression directly by exponentiating coefficients. KNN and CART do not naturally provide odds ratios.
What is KNN?
a data-driven, nonparametric method used for classification and prediction. It predicts a new record by finding similar records in the training data. For classification, it uses a vote among nearby records; for numerical prediction, it averages the outcomes of nearby records.
Why is KNN called data-driven instead of model-driven?
does not build an explicit equation like linear or logistic regression. The training data itself acts as the model. When a new record appears, compare that record to the stored training records and uses the nearest ones to make the prediction.
What is the main idea behind KNN classification?
records with similar predictor values are likely to have similar class membership. A new record is assigned to the class that appears most often among its K nearest neighbors.
What is the "locality assumption" in KNN?
The locality assumption is that nearby points in predictor space tend to have similar outcomes. In other words, if two records are close based on their predictor values, they are expected to belong to similar classes or have similar numerical values.
What are the two key questions in KNN?
how do we measure "nearby," and how do we choose the best value of K? "Nearby" is usually measured with distance, and K is chosen by comparing performance on validation data.
Why is standardization important before KNN?
because KNN uses distance. If one variable is measured on a much larger scale, it can dominate the distance calculation and make other predictors matter less. Standardizing puts variables on a comparable scale.
How should new records be standardized in KNN?
using the mean and standard deviation from the training data. The new records should not be included when calculating those values because that would leak information and change the measurement scale.
What happens when K = 1?
the new record is assigned the class of its single closest training record. This is simple and very flexible, but it can be sensitive to noise, outliers, or unusual records in the training data.
How does KNN work when K > 1?
the algorithm finds the K closest training records. For classification, it uses majority vote. For example, if K = 3 and two of the three nearest neighbors are owners, the new record is classified as owner.
How is an estimated class probability calculated in KNN?
An estimated class probability is calculated by dividing the number of neighbors in the target class by K. If K = 3 and 2 neighbors are owners, the estimated probability of owner is 2/3 = 0.67.
What role does the cutoff value play in KNN classification?
converts the estimated probability into a class. With a 0.50 cutoff, a probability above 0.50 is classified as class 1. The cutoff can be adjusted to balance false positives and false negatives.
What is the default cutoff value usually used in classification?
The default cutoff value is usually 0.50. That means the estimated probability must be greater than 50% for the record to be classified as class 1.
Why might the cutoff value be changed?
if one type of error is more costly than another. For example, lowering the cutoff can classify more records as class 1, which may reduce false negatives but increase false positives.
How do you choose the best K?
Split the data into training and validation sets, test several K values, calculate the validation error or misclassification rate for each K, and choose the K with the lowest validation error or highest accuracy.
What should you do if multiple K values tie for the lowest validation error?
choose the lowest K if several K values have the same lowest error rate. This keeps the model more local while still matching the best validation performance.
What is the risk of using a very small K?
creates flexible decision boundaries that follow individual records closely. This can overfit the training data because the model may react too strongly to outliers or noise.
What is the risk of using a very large K?
smooths out the decision boundary too much. It can miss local patterns and, if K approaches the full training set size, may simply predict the most common class for almost everything.
What are the main advantages of KNN?
simple, easy to understand, makes no distribution assumptions, and can capture complex interactions because it does not force the relationship into a specific equation.
What are the main shortcomings of KNN?
can require a large training set, becomes weaker in high-dimensional data, suffers from the curse of dimensionality, and can be computationally expensive because it calculates distances to training records for each prediction.
What is the curse of dimensionality in KNN?
As the number of predictors increases, distances become less meaningful because most points become far apart from each other. This makes it harder for KNN to identify truly similar neighbors.
Why can KNN be computationally expensive?
must calculate and sort distances between the new record and many training records each time a prediction is made. With large datasets, this can take significant time and processing power.
What is the basic KNN workflow?
Standardize predictors, split into training and validation sets, compute distances from validation or new records to training records, identify the K nearest neighbors, apply majority vote or averaging, and choose K based on validation performance.
What is logistic regression?
a classification method used when the outcome is categorical, especially binary outcomes coded as 0 and 1. It predicts the probability that a record belongs to class 1.
How is logistic regression related to linear regression?
Logistic regression extends the idea of linear regression but uses it for categorical outcomes. Instead of predicting Y directly, it models a transformation of the probability of Y = 1.
Why can't ordinary linear regression be used directly for binary classification?
can produce predicted values below 0 or above 1, but probabilities must stay between 0 and 1. A binary outcome also does not match the unlimited range of a linear equation.
What does logistic regression predict first?
the estimated probability that the outcome equals 1. That probability is then converted into a class using a cutoff value.
Why does logistic regression use the logit?
allows the left side of the model to have the same unlimited range as the right side linear predictor. This makes it possible to model a binary outcome using a linear combination of predictors.
What are odds?
compare the probability that an event happens to the probability that it does not happen. The formula is = p / (1 - p).
What is the logistic regression equation in logit form?
logit(p) = log(odds) = beta0 + beta1x1 + beta2x2 + ... + betanxn. The right side is a linear combination of predictors.
What is the sigmoid function doing in logistic regression?
converts the linear predictor into a probability between 0 and 1. This creates the S-shaped curve that keeps predictions within the valid probability range.
What preprocessing steps are used before logistic regression?
partitioning the data into 60% training and 40% validation, creating dummy variables for categorical predictors, and selecting relevant demographic and banking relationship variables.
How does logistic regression turn probability into class membership?
After estimating the probability of class 1, the model compares it to a cutoff value. If the probability is greater than the cutoff, the record is classified as 1; otherwise it is classified as 0.
What is the default classification cutoff?
usually 0.50. If the predicted probability is greater than 50%, the model predicts class 1.
Why might the cutoff be optimized instead of staying at 0.50?
may improve classification accuracy or better match business goals. For example, if missing potential loan acceptors is costly, a lower cutoff may be used to classify more customers as likely acceptors.
What does statistical significance test in logistic regression?
whether a coefficient is statistically different from zero. The null hypothesis is H0: beta = 0, meaning the predictor has no relationship with the outcome probability.
Why can correlated predictors be a problem?
can be bias or destabilize coefficient estimates, similar to multicollinearity in linear regression. This can make it harder to interpret individual predictor effects.
What is overfitting in logistic regression?
when the model includes too many predictors or captures random patterns in the training data. It may look good on training data but perform poorly on validation or new data.
How can logistic regression be improved?
through variable selection, removing redundant predictors, checking statistical significance, using validation data, and considering dimension reduction when predictors are highly correlated.
What are the key takeaways of logistic regression?
Logistic regression adapts linear regression for categorical outcomes, uses the logit function, predicts probabilities, converts probabilities into classes with a cutoff, and can be used for both profiling and prediction.
What does CART stand for?
Classification and Regression Trees. It includes classification trees for categorical outcomes and regression trees for continuous numerical outcomes.
What is a classification tree?
predicts a categorical outcome by splitting records into groups based on predictor values. Each terminal leaf assigns a class label.
What is a regression tree?
predicts a continuous outcome. Each terminal leaf predicts a numerical value, usually the average outcome value of records in that leaf.
What is the main goal of a tree model?
classify or predict an outcome using a set of predictors. The tree divides observations into subgroups and creates rules that can be followed to make predictions.
Why are trees easy to interpret?
their output is a set of IF-THEN rules. You can read down the tree from the root to a leaf and turn that path into a classification or prediction rule.
What is recursive partitioning?
the process of repeatedly splitting the dataset into smaller parts. Each split tries to make the resulting groups more homogeneous with respect to the outcome.
How does the tree choose the first split?
The algorithm considers possible splits across predictor variables and split values, measures the impurity after each split, and chooses the split that produces the lowest impurity or greatest impurity reduction.
What are the basic recursive partitioning steps?
Continue dividing each partition into smaller groups, applying the same splitting criteria to each subset,
It involves repeatedly dividing the data into subsets.The process continues until a stopping criterion is met.
What is impurity?
measures how mixed a node is. A node with many classes evenly represented has high impurity, while a node with mostly one class has low impurity.
What are the two impurity measures from the slides?
Gini index and entropy. Both are based on the proportions of cases in a node that belong to each class.
What is the maximum Gini value for two equally represented classes?
0.5 when the classes are equally represented. This means the node is highly mixed.
What is the maximum entropy value?
log2(m), where m is the total number of classes. Entropy is highest when the classes are evenly represented.
How are split points represented in a tree?
become nodes in the tree. The branches show the result of the split, and the terminal endpoints are called leaves.
What are leaves in a tree?
are terminal nodes where no further split is made. A leaf contains the final predicted class for classification or final predicted average for regression.
How do you classify a new observation using a tree?
Drop the new observation down the tree, compare its predictor values with each split point, follow the matching branches, and assign the class shown at the terminal leaf.
Why can rules sometimes be simplified?
Some conditions may become redundant. If a later rule already implies an earlier condition, the earlier condition can sometimes be removed without changing the meaning of the rule.
How are possible split values found for numerical variables?
Records are ordered by a predictor, and midpoints between successive values are tested as possible split points. For example, if lot sizes are 14.0 and 14.8, the midpoint is 14.4.
What problem occurs with categorical variables that have many categories?
The number of possible category splits becomes very large, which increases computation. The slides note that XLMiner supports only binary categorical variables.
Why do we need training and validation sets for trees?
The training set is used to grow the tree, while the validation set is used to evaluate how well the tree predicts new data. This helps detect overfitting.
What is overfitting in a tree?
occurs when the tree grows too detailed and captures noise in the training data. It may achieve high purity on training records but poor predictive accuracy on validation data.
How does validation error reveal overfitting?
As the tree grows, training error usually decreases, but validation error eventually starts increasing. That increase shows the tree is becoming too complex and less generalizable.
What is pruning?
growing the tree first and then cutting back peripheral branches to avoid overfitting. It balances lower misclassification error against a simpler tree with fewer nodes.
What is the minimum error tree?
the tree size that has the lowest error rate on the validation data.
What is the best pruned tree?
the smallest tree whose validation error is within one standard error of the minimum error tree. It favors a simpler model with nearly the same performance.
What is CHAID?
Chi-Squared Automatic Interaction Detection. It is a decision tree algorithm that uses chi-square tests to find optimal splits, especially for categorical predictors.
How is CHAID different from standard CART?
can create multiway splits instead of only binary splits. It uses statistical significance and p-values rather than impurity measures like Gini or entropy.
Why is CHAID useful?
for categorical data, market segmentation, and survey analysis. It automatically merges categories that are not significantly different and naturally stops growing through statistical tests.
How are regression tree predictions calculated?
the prediction at a leaf is usually the average outcome value of the training records in that leaf.
How is regression tree performance measured?
is often measured by RMSE, or root mean squared error, which shows the typical size of prediction errors for continuous outcomes.
What are the main advantages of trees?
easy to use and understand, rules are easy to interpret, they perform automatic variable selection, make no strong assumptions, are computationally cheap to deploy, and can be robust to outliers and missing values.
What are the main disadvantages of trees?
often require large amounts of data and may not perform well with complex data. A single tree can also overfit if it is not pruned.
Which method is most sensitive to variable scaling?
most sensitive to scaling because distance calculations can be dominated by large-scale variables. Logistic regression can also benefit from scaling in some settings, but the KNN slides emphasize standardization as essential.
Which method uses pruning?
CART. The tree is grown and then branches are cut back to reduce overfitting and improve validation performance.