CAP4630 - Midterm Review - Flashcard Set

0.0(0)

Studied by 0 people

0.0(0)

Call with Kai

Open Podcast

Knowt Play

New

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/206

Earn XP

Description and Tags

From Linear Regression to Dimension Reduction

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

207 Terms

New cards

What is supervised learning?

The basic supervised learning framework, y = f(x), where y = output, f = the mapping function, and x = the input.

New cards

What is the goal of the supervised learning in linear regression?

Given a training set of labeled examples {(x1, y1), …, (xn, yn)}, estimate the parameters of the prediction function f.

New cards

What is the inference of supervised learning in linear regression?

Apply f to a never before seen test example x and output the predicted value y = f(x).

New cards

What is the learning goal of supervised learning for linear regression?

The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data accurately.

New cards

What can supervised learning problems be grouped into?

Supervised learning problems can be further grouped into Regression and Classification problems.

New cards

What is the difference between regression problems and classification problems for supervised learning problems?

Output or predictive value: numerical for regression (real or continuous value), categorical for classification.

New cards

What is a classification problem in supervised learning?

A classification problem is when the output variable is a category (discrete class label), such as “red” or “blue” or “disease” and “no disease”.

New cards

What form does a continuous value prediction take?

A classification algorithm may predict a continuous value, but the continuous value is in the form of a probability for a class label.

New cards

What are the two types of classification?

Binary classification - when there is only two classes to predict, usually 1 or 0 values. Multi-Class classification - when there are more than two class labels to predict – e.g. image classification problems where there are more than thousands classes (cat, dog, fish, car,…)

New cards

What are some algorithms for classification?

Decision Trees, Logistic Regression, Naive Bayes, K Nearest Neighbors, SVM, Neural Network

New cards

What are some algorithms for regression?

Linear Regression, Regression Trees (e.g. Random Forest), Support Vector Regression (SVR)

New cards

What are we trying to do in linear regression?

In linear regression, we try to explain or predict one variable using one or more other variables.

New cards

What variable do we want to predict or explain in linear regression? What variables are we using as inputs or predictors?

The variable we want to predict or explain is called the dependent variable (often denoted as y). The variables we use as inputs or predictors are called the independent variables (often denoted as x_1, x_2, …, x_n).

New cards

What are the types of regression models, generally speaking? What makes them specific?

There are simple models, which have 1 explanatory variable (independent (input) variables or predictors) and there are multiple models with 2+ explanatory variables (independent (input) variables or predictors).

New cards

What are the types of regression models, specifically? What makes them specific?

For simple ones, there are linear and non-linear models, and for multiple ones, there are linear and non-linear models there too.

New cards

What is the regression problem we’re solving with linear regression?

In the regression problem, the task is: approximate a mapping function (h) from input variables (x) to continuous output variables.

New cards

What type of approach does linear regression have to supervised learning? What is the assumption it makes?

Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X1, X2, …, Xp is linear.

New cards

List the hypothesis equation for linear regression and define each parameter.

y = B0 + B1x1 + … + Bkxk, where y = the response or dependent variable, B0 + … + Bkxk is a linear combination of every weight and input, and Bkxk is the predictor or independent variables.

New cards

What makes a regression model linear?

A regression model is linear when all terms in the model are one of the following: the constant (B0), or a parameter multiplied by an independent variable (Bkxk).

New cards

What types of models are the following:

theta1 * X^(theta2)

theta1 (cos (X + theta 4)) + theta 2 (cos(2 * X + theta 4)) + theta 3

Y = b0 + b1X1 + b2X1^2

Nonlinear

Linear

New cards

What does the term “linear” refer to?

The term “linear” in linear regression refers to linearity in the parameters (coefficients), not necessarily in the input variables.

New cards

Are the true regression functions always linear?

True regression functions are not necessarily linear.

New cards

What part of a linear regression equation must be linear? What part can you transform?

While the equation must be linear in the parameters, you can transform the predictor (or independent) variables in ways that produce curvature. Y

New cards

What type of curve does the function Y = b0 + b1X1 + b2X1^2 make?

Y = b0 + b1X1 + b2X1^2 makes a U-shaped curve.

New cards

Why should we assume that relationships between variables are linear in linear regression?

Because linear relationships are the simplest non-trivial relationships that can be imagined (hence the easiest to work with)

Because the "true" relationships between our variables are often at least approximately linear over the range of values that are of interest to us

Because even if they're not, we can often transform the variables in such a way as to linearize the relationships.

New cards

Can you give an example of an equation transformed to be linear (linearized relationships)?

y = ae^(bx)U with parameters a and b becomes ln(y) = ln(a) + bx + u.

New cards

How do we represent the mapping function h using linear regression with one variable (x)?

Linear regression with one variable (x) – simple linear regression. Linear function h(x) = theta 0 + theta 1 * x, where theta 0 and theta 1 are parameters.

New cards

How do you choose the parameters in linear regression?

In linear regression, the parameters are the coefficients that determine how much each input feature contributes to the predicted output. The goal is to choose these parameters so that the model’s predictions are as close as possible to the actual values in the data.

This is done by defining a cost function, usually the mean squared error, which measures the average squared difference between predicted and actual values. Smaller values of this cost function mean the model is making more accurate predictions.

To find the best parameters, the model minimizes the cost function. This can be done using an analytical approach, which calculates the exact solution using a formula, or using an iterative approach called gradient descent, which gradually adjusts the parameters in the direction that reduces the error.

Optionally, regularization techniques can be added to the cost function to prevent overfitting by penalizing large parameter values, leading to more generalizable predictions.

New cards

What is the cost function for linear regression? What is the goal of linear regression?

J(theta 0, theta 1) = 1/2m * summation from i = 1 to m (h subscript theta (x^(i)) - y^(i))^2. The goal of linear regression is to minimize J(theta 0, theta 1) (the cost function).

New cards

How to find the minimum point of cost function for linear regression? What is its algorithm? List its steps.

Gradient descent. The gradient descent starts with some theta 0, theta 1. You keep changing theta 0, theta 1, to reduce J(theta 0, theta 1) until we hopefully end up at a minimum.

New cards

What is the formal definition of gradient descent? What do gradient and alpha mean in the context of the equation?

theta j = theta j - alpha * (partial derivative with respect to theta j) * J(theta 0, theta 1) (for j = 0 and j = 1). Gradient means partial derivative, and the alpha is the learning rate (a positive constant).

New cards

Why “ - ” the gradient in parameter update?

Note that a gradient is a vector, so it has both a direction and a magnitude. The gradient always points in the direction of steepest increase in the loss function. Example: if you’re standing on a hill, the gradient tells you which way is “uphill” the steepest. So accordingly, the gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.

New cards

What are risks of the learning rate’s possible values? How to set a proper value for the learning rate ⍺ ?

If the learning rate is too large, gradient decent can overshoot the minimum. It may fail to converge, or even diverge. If the learning rate is too small, gradient descent may require many updates before reaching the minimum point. Only the optimal learning rate swiftly reaches the minimum point. To set a proper value for the learning rate is by trial and error, just trying values from 0.1 to 0.001, parameter tuning via grid search.

New cards

How can you see if gradient descent is working?

To see if gradient descent is working, print out J(theta) each iteration - The value should decrease at each iteration - If it doesn’t, adjust alpha.

New cards

Judge these statements as true or false.

If the learning rate is too small, then gradient descent may take a very long time to converge.

If theta 0 and theta 1 are initialized at a local minimum, then one iteration will not change their values.

Even if the learning rate alpha is very large, every iteration of gradient descent will decrease the value of f(theta 0, theta 1).

If theta 0 and theta 1 are initialized so that theta 0 = theta 1, then by symmetry (because we do simultaneous updates to the two parameters), after one iteration of gradient descent, we will still have theta 0 = theta 1.

True, True, False, False.

New cards

Explain why this is true:

If the learning rate is too small, then gradient descent may take a long time to converge.

Gradient descent updates parameters by moving them in the direction opposite to the gradient of the cost function. The size of each step is determined by the learning rate α\alphaα. If α\alphaα is very small, each update moves the parameters only slightly, meaning it will require many iterations to reach the minimum. While a small learning rate is safe (it avoids overshooting the minimum), it can make the algorithm very slow, especially for high-dimensional data or complex cost surfaces. Therefore, convergence may take a long time when the learning rate is too small.

New cards

Explain why this is true:

If theta_0 and theta_1 are initialized at a local minimum, then one iteration will not change their values.

At a local minimum, the gradient of the cost function with respect to all parameters is zero. Gradient descent updates parameters using the formula:

theta_j := theta_j - alpha (partial J / partial theta_j)

If (partial J / partial theta_j) =0 for all j, then theta_j does not change in that iteration. Since theta_0 and theta_1 are at a local minimum, the gradients are zero, and the parameters remain exactly the same after one iteration.

New cards

Explain why this is false:

Even if the learning rate alpha is very large, every iteration of gradient descent will decrease the value of f(theta 0, theta 1)

A large learning rate can cause gradient descent to overshoot the minimum. Instead of decreasing the cost function, the updates may move the parameters past the minimum, or even to regions where the cost is higher. In extreme cases, gradient descent may fail to converge entirely and cause the cost to oscillate or diverge. Therefore, a very large learning rate does not guarantee that the cost decreases in every iteration.

New cards

Explain why this is false:

If theta_0 and theta_1 are initialized so that theta_0 = theta_1, then by symmetry (because we do simultaneous updates to the two parameters), after one iteration of gradient descent, we will still have theta_0 = theta_1.

False. Gradient descent updates each parameter based on the partial derivative of the cost function with respect to that parameter. Even if theta_0 and theta_1 start out equal, their gradients, (partial J / partial theta_0) and (partial J / partial theta_1), are generally not equal unless the cost function is perfectly symmetric in those parameters. Since the updates subtract the gradient times the learning rate from each parameter, differing gradients will cause theta_0 and theta_1 to change by different amounts. As a result, the equality theta_0 = theta_1 will usually be broken after one iteration.

New cards

In linear regression with multiple variables, what are rows and what are columns?

Each row represents a single data point (or sample). Since each data point can have multiple features, each row is an n-dimensional vector, where n is the number of features. So if you have 100 data points and 3 features, your design matrix X has 100 rows, each with 3 values (or 4 if you include the bias term). Each column corresponds to a single feature (input variable) across all data points. For example, the first column might be “age,” the second “height,” and so on. Each column contains all the values of that feature for every sample.

New cards

What is the hypothesis/mapping function before with a single variable? What is it now for linear regression with multiple variables?

Before: h subscript theta (x) = theta 0 + theta 1 x. Now: h subscript theta (x) = theta 0 + theta 1 * x1 + theta 2 * x2 + … + theta n x n.

New cards

What do we define x0 as for convenience of notation? What is the size of each data point/feature x now?

For convenience of notation, we define x0 = 1. Each data point is now an (n + 1) - dimensional vector.

New cards

Define, for linear regression with multiple variables, the hypothesis, the parameters, the cost function, the gradient descent, and the updated gradient descent with the cost function included.

Hypothesis = h subscript theta (x) = theta^Transpose x = theta 0 + theta 1 x1 + theta 2 x2 + … + theta n x n. Parameters = theta 0, theta 1, …, theta n. Cost function = J(theta 0, theta 1, …, theta n) = 1/2m summation from 1 to m (h subscript theta (x^(i)) - y^(i))^2. Gradient descent = theta j = theta j - alpha * (partial / partial * theta j) * J (theta 0, …, theta n). Gradient descent with cost built in, gradient descent with cost = theta j = theta j - alpha * 1/m summation from 1 to m (h subscript theta (x^(i)) - y^(i)) * x subscript ^(i)

New cards

What is the stopping criterion for linear regression with multiple variables?

Assume convergence when ||theta subscript new - theta subscript old|| subscript 2 < e, where e is a small constant, called the threshold. Set a maximum number of iterations as well.

New cards

What is the closed-form solution for linear regression?

The closed-form solution for linear regression also known as the normal question for linear regression, is equal to X^T * X * theta = X^T * y. In linear regression, the closed-form solution refers to a direct, non-iterative way to compute the parameters theta that minimize the cost function (mean squared error). You don’t need gradient descent or any iterative optimization — you just calculate it using a formula derived from setting the derivative of the cost function to zero. theta = X^T * X^-1 * X^T * y.

New cards

What do you do if X^T * X is not invertible (singular)?

Use pseudo-inverse instead of the inverse or remove the redundant (not linearly independent) features.

New cards

What is the difference between gradient descent and the normal equation/the closed form solution for theta? (Given that there are m training examples, n features)

For gradient descent, you need to choose a learning rate alpha, you need many iterations of gradient descent’s parameter/theta calculations, and it works well even when n is large. For the normal equation/the closed form solution, there’s no need to choose a learning rate alpha and no need to iterate, you find theta through theta = X^T * X^-1 * X^T * y. However the normal equation calculation is slow if n is very large.

New cards

What is overfitting in linear regression?

Overfitting in linear regression is when the learned hypothesis may fit the training set very well (J(theta) = 0 so the parameters predicted match the original values exactly with no distance), but fails to generalize to new examples.

New cards

How do we solve overfitting?

Regularization, which is a method for automatically controlling the complexity of the learned hypothesis. Regularization penalizes for large values of theta subscript j, and can incorporate into the cost function, which works well when we have a lot of features, each that contributes a bit to predicting the label y. Regularization can also address overfitting by eliminating features (either manually or via model selection).

New cards

How do we work regularization into the linear regression model? What does the linear regression with L2 regularization (what we use here) measure? What is a note to make about theta 0 here?

We add regularization to the objective function, or the cost function (mean squared error), such that J(theta) = 1/2m * [summation from 1 to m of (h theta (x^(i)) - y^(i))^2) + lambda * summation from 1 to n of (theta subscript j)^2]. The linear regression with L2 regularization (what we use here) measures the magnitude of the parameter vector. Lambda is the regularization parameter (where lambda >= 0). A note to make about theta 0 here is that there is no regularization done on theta 0, as that’s the bias term.

New cards

What do large weights mean for regularization in linear regression?

In linear regression, h theta (x) = theta 0 + theta 1 * x1 + theta 2 * x2 + … + theta n * xn. If one coefficient theta_j is very large, it means the model is relying heavily on feature xj. That makes the model very sensitive: a small change in xj can cause a large change in the prediction.

New cards

What does regularization do?

Regularization adds a “cost” for large weights: lambda / 2m * summation from j = 1 to n of (theta j)^2. During training, gradient descent tries to reduce both the error term and the penalty term. Result: the optimizer prefers smaller, balanced weights that still explain the data.

New cards

What do we add regularization to in linear regression and why?

In linear regression, we add regularization to the cost function. The cost function measures how far the model’s predictions are from the actual values, and regularization adds an extra penalty for large parameter values. The purpose of this is to prevent the model from overfitting the training data by keeping the coefficients smaller and more stable. By adding this penalty, the model balances fitting the data well with maintaining simpler, more generalizable parameters, which helps it perform better on new, unseen data

New cards

What does regularized linear regression’s closed form solution (solution for theta, also known as the normal equations) look like?

theta = (X^T * X + lambda * [L])^-1 * X^T * y, where L = I/the identity matrix but with the top left element set to 0, the same size as X^T * X (n + 1) by (n + 1), where n = the number of parameters. You can derive the normal equation with regularization the same way by solving (partial/partial * theta) * J(theta) = 0 (as in if you solve for (partial/partial * theta) * J(theta) = 0) you will get the normal equations with regularization included already).

New cards

How to choose the regularization parameter?

The regularization parameter lambda controls how strongly the model penalizes large coefficients. Choosing the right lambda is important because a value that is too small may allow overfitting, while a value that is too large may cause underfitting. A common and systematic approach is k-fold cross-validation. The dataset is split into K folds, and the model is trained on K-1 folds while tested on the remaining fold. This process is repeated for each fold and for a set of candidate lambda values. The average validation error across all folds is computed for each lambda, and the value that minimizes this error is selected. Alternatively, if cross-validation is not feasible, lambda can be chosen heuristically by testing a reasonable range of values, such as 0.001 to 0.3, and selecting the one that produces the best performance on a validation set. While less systematic, this approach can work when resources or data are limited.

New cards

What is the main objective of logistic regression?

Logistic regression is a model used for binary classification, where the goal is to predict the probability that a given instance belongs to a particular class (usually labeled 0 or 1). Instead of directly outputting a class label, logistic regression estimates a probability value between 0 and 1 using the sigmoid (logistic) function applied to a linear combination of the input features. After computing this probability, you can assign the instance to a class by choosing a threshold, typically 0.5, where probabilities above the threshold are classified as one class and below as the other.

New cards

What is the hypothesis function of logistic regression? What is the sigmoid function and what does it do?

The hypothesis function for logistic regression is h_theta(x) = g(theta^T * x), where theta^T * x is the linear combination of the input features and model parameters.

The function g(z) is called the sigmoid function, defined as g(z) =1/(1+e^(−z)). The sigmoid function takes any real-valued input z and maps it to a value between 0 and 1. This allows the hypothesis function to output a probability that the input belongs to the positive class. By applying the sigmoid function, logistic regression converts the linear combination of features into a probability that can then be used for binary classification.

New cards

What is the final hypothesis function for logistic regression with the sigmoid function included?

hypothesis for logistic regression is h subscript theta x = 1/(1 + e^(-theta^T * x)).

New cards

Where do we place the default threshold for logistic regression and why?

The default threshold for logistic regression is placed at 0.5. This means that if the predicted probability h_theta(x) is greater than or equal to 0.5, the instance is classified as belonging to the positive class (1); if it is less than 0.5, it is classified as belonging to the negative class (0). The threshold is set at 0.5 because it represents the midpoint of the probability range between 0 and 1, which makes it a natural choice when both classes are equally important and the data is relatively balanced. However, the threshold can be adjusted depending on the desired trade-off between precision and recall.

New cards

When might you adjust the treshold/the trade-off between precision and recall?

You might adjust the logistic regression threshold/the trade-off between precision and recall when, for instance, you might lower it below 0.5 to detect more positive cases when missing a positive case is costly (as in medical diagnoses) (higher recall = more true positives), or raise it above 0.5 to be more conservative and reduce false positives when positive predictions carry a high cost (such as in fraud detection) (higher precision = more correct positives).

New cards

Is logistic regression a linear classifier?

Yes, standard logistic regression (with raw features only) has a linear decision boundary: theta^T * x = 0 is always a hyperplane. So if you use the original features x1, x2, …, xd, the separation between classes is linear. Nonlinear decision boundaries arise if you transform the input features before applying logistic regression. For example, if you add polynomials (e.g., x1^2, x1x2) or kernel features, then the decision boundary becomes linear in that transformed space, but appears nonlinear in the original space.

New cards

What are the steps for logistic regression?

Given {(x^(1), y^(1)), (x^(2), y^(2)), …, (x^(m), y^(m))} m training samples, where x^(i) is in R^n, y^(i) is in {0, 1}. You model the hypothesis function h_theta(x) = g(theta^T * x), where g(z) = 1/(1 + e^-z) scales theta^T * x to [0, 1]. Theta = [theta_0 … theta_n] (as a column however), and x^T = [1 x1 … x subscript n].

New cards

What is the cost function for logistic regression?

You find the parameters using the logistic regression cost function, where LRCF = J(theta) = 1/m * summation from 1 to m of (Cost Function(h_theta(x^(i)), y^(i)) = -1/m [ summation from 1 to m of y^(i) * log h_theta(x^(i)) + (1 - y^(i)) log(1 - h_theta(x^(i))], where m is the number of training examples.

New cards

What is the difference between log loss and MSE loss?

Log loss gives a larger penalty for larger distances between the predicted value and the actual value than SE loss.

New cards

What is the gradient descent algorithm for Logistic Regression?

Gradient descent iteratively updates the parameters theta in the direction that reduces J(theta), continuing until the cost function converges to a minimum value. You repeat theta j = theta j - alpha * 1/m * summation from 1 to m (h_theta (x^(i)) - y^(i))) * x_j^(i) until thetas are minimized and able to predict continually and successfully.

New cards

What can we use gradient descent for in logistic regression?

We can use gradient descent to learn parameter values, and hence compute the prediction for a new input

New cards

How do we make a prediction given a new x?

Output h_theta(x) = 1/(1 + e^(-theta^T * x)) = estimated probability that y = 1 on input x.

New cards

How do you use logistic regression for multi-class classification?

One vs all, where you train a logistic regression classifier h_theta^(i) (x) for each class i to predict the probability that y = i. On a new input x, to make a prediction, pick the class i that maximizes max h_theta^(i) (x), which gives you a probability score, of which it will output the class which you have the highest probability of belonging to.

New cards

What approach does logistic regression take to learning discriminative functions (i.e., a classifier)?

Logistic regression takes a probabilistic approach to learning discriminative functions, where the hypothesis function h_theta(x) should give p(y = 1 | x; theta) or the probability of y = 1 given x and the parameters. We want 0 <= h_theta(x) <= 1.

New cards

What is the cost function for logistic regression?

Cost(h_theta(x), y) = -[y * log(h_theta(x)) + (1 - y) * log(1 - h_theta(x))]

New cards

Define overfitting. What will training and testing on the same data create?

A classifier that performs well on the training examples, but poorly on new examples. Training and testing on the same data will generally produce a good classifier (for this dataset) with high overfitting. (Never do this!)

New cards

How can we avoid overfitting? What is cross-validation?

Use cross-validation and use the simplest model possible (Regularization). Cross-validation is a method used to estimate how well a model will perform on unseen data by systematically splitting the dataset into different training and testing sets. In k-fold cross-validation, the data is divided into K equal parts called folds. The model is trained on K minus one folds and tested on the remaining fold, and this process is repeated K times so that each fold serves as the test set once. The results are then averaged to give a more reliable measure of model performance. Unlike a simple split between training and testing sets, cross-validation rotates which data are used for training and testing, reducing bias from a single split and providing a better estimate of how the model will generalize to new, unseen data.

New cards

What is regularization in logistic regression?

Regularization in logistic regression means to update the gradient descent function so that J(theta) = 1/2m * [summation from i = 1 to m of (h_theta (x^(i)) - y^(i))^2 + lambda * summation from j = 1 to n of theta_j^2, where n = number of features and theta_0 is not penalized. J(theta) regularized = J(theta) + lambda/2m * summation from j = 1 to n of (theta_j)^2.

New cards

How would we write the regularization cost function for logistic regression if it were L1 regularization? What does this do?

J_regularized(theta) = J(theta) + lambda * summation from j = 1 to n of |theta_j|. L1 regularization encourages sparsity in the logistic regression model and this is good because it can reduce overfitting, simplify the model by effectively removing irrelevant features, and make the model easier to interpret.

New cards

List every equation written in the “Summary (logistic regression): things to remember” part of the PDF.

Hypothesis Function = h_theta(x) = 1/(1 + e^(-theta)^T * x)

Cost Function = Cost(h_theta(x), y) = {-log(h_theta(x)) if y = 1 or -log(1 - h_theta(x)) if y = 0}

Logistic Regression with gradient descent = theta_j = theta_j - alpha * 1/m * summation from 1 to m of (h_theta(x^(i)) - y^(i)) * x_j^(i)

Logistic regression with gradient descent and regularization = theta_j = theta_j * (1 - alpha (lambda/m)) - alpha * 1/m * summation from 1 to m of (h_theta(x^(i)) - y^(i)) * x_j^(i)

Multi-class classification = max h_theta^(i) (x).

New cards

What is accuracy for logistic regression?

Accuracy is the percent of correct classifications. Accuracy = Correct Predictions / Total Predictions. Error rate is the percent of incorrect classifications. Accuracy = 1 – Error rate.

New cards

What are some problems with using accuracy for evaluation metrics?

Problems with the accuracy are that accuracy assumes equal costs for misclassification and assumes a relatively uniform class distribution, E.g. imbalanced dataset. Consider 95 negative samples and 5 positive samples. Classifying all samples as negative in this case gives 0.95 accuracy score.

New cards

What is the formula for recall, or the true positive rate? What is the formula for specificity, or the true negative rate? What is the formula for precision, or the positive predictive value?

TPR = True Positives/All Positives = True Positives/(True Positives + False Negatives)

TNR = True Negatives/All Negatives = True Negatives/(True Negatives + False Positives)

PPV = True Positives / (True Positives + False Positives)

New cards

What is the formula for F1 score and what is the F1 score?

The formula for F1 score is F1 = 2 * (Precision * Recall) / (Precision + Recall) and the F1 score is the harmonic mean of precision and recall.

New cards

How does a classifier such as logistic regression function? How does it make a classification decision?

A classifier such as logistic regression outputs a score or probability ˆp ∈ [0, 1]. To make a classification decision, we apply a threshold where y_hat class = {1 if y_hat/predicted class is >= the threshold and 0 otherwise.

New cards

What may change as you refine the logistic regression classifier for different applications? What helps about the ROC curve?

Different applications may require different trade-offs between detecting positives (high recall) and avoiding false alarms (high precision). An ROC (Receiver Operating Characteristic) curve visualizes performance across all thresholds, rather than fixing one. The Area Under the ROC Curve (AUC) is a scalar summary: 1.0 is perfect, 0.5 is random guessing. The larger the AUC, the better the classifier’s overall performance across all classification thresholds.

New cards

What is the connection between precision and recall?

In practice, one always needs to make a compromise between these two metrics: by increasing Recall, we decrease (though unwillingly) Precision, and vice versa.

New cards

What are solutions to imbalanced data?

Oversampling: re-sampling of data from minority class. Under-sampling: randomly eliminate samples from majority class. Synthesizing new data points for minority class (e.g, take averages of samples in minority class, adding small noise to samples in minority class.) Adjusting class weights

New cards

What are the details about a discriminative model?

A discriminative model is aiming to learn p(y | x) or the probability of y (Class e.g. malignant tumor or benign) given x (Features (observation)). The focus is: given the features X, what is the probability of class Y? Examples: logistic regression, SVM, neural networks. Doesn’t model how the data is generated — only the decision boundary.

New cards

What are the details about a generative model?

Learn p(x|y) (What is probability of having the feature conditioned on class y) and p(y) (Class prior e.g. y = 0 is benign, y = 1 is malignant, without any information). Learns the joint distribution P(X, Y), usually decomposed into P(X, Y) = P(Y) * P(X|Y) That means: Learn the class prior P(Y), learn the class-conditional likelihood P(X|Y), then apply Bayes’ rule to get P(Y|X).

New cards

What are examples of generative models?

Examples: Naïve Bayes, Gaussian mixture models, Hidden Markov Models

New cards

What are examples of discriminative models?

Examples: logistic regression, SVM, neural networks.

New cards

What is the Naive Bayes Classifier?

The Naive Bayes Classifier is a probabilistic model based on Bayes’ Theorem, used mainly for classification tasks. It assumes that all features in the input are conditionally independent given the class label — an assumption that is often unrealistic but works surprisingly well in practice. The classifier calculates the probability of each class given the input features and predicts the class with the highest probability. It is widely used for text classification, spam detection, and sentiment analysis because it is simple, efficient, and performs well with large datasets.

New cards

Define the Bayes Rule.

Bayes’ Rule describes how to update the probability of a hypothesis as more evidence becomes available. It is defined as:

P(A|B)= P(A) * P(B|A) / P(B)

In words, this means the probability of event A given event B equals the probability of A times the probability of B given A, divided by the probability of B. In classification, Bayes’ Rule allows us to compute the probability of a class given the observed features, forming the foundation of the Naive Bayes Classifier.

New cards

What theorem is used in Bayes Classifiers?

The Bayes Theorem.

New cards

Define the parameters of Bayes’ Theorem. Define the full Bayes’ Theorem in words.

P(A|B) is Posterior probability: the probability of event A (the hypothesis or class) being true after seeing the evidence B. P(B|A) is Likelihood: the probability of observing the evidence B if the hypothesis A is true. P(A) is prior probability: what we believed about A before seeing any evidence. P(B) is evidence or marginal probability: the overall probability of observing B, regardless of which hypothesis is true.

Bayes’ Theorem = P(A|B) = P(B|A) * P(A)/ P(B) = Posterior Prob. = Likelihood * Prior / Evidence (or normalization factor).

New cards

Define posterior probability.

“Posterior” means after seeing the evidence. It is the probability of the class, given that we have observed features. Example: probability of being sick after seeing symptoms. Posterior (p(c_j | d)) means the probability of instance d being in class c_j.

New cards

Define likelihood.

This is how likely the evidence is if the class (e.g. c) is true. Example: probability of these symptoms appearing if the patient has the disease. It is not a probability of the hypothesis itself, but of the data under that hypothesis. Likelihood (p(d | c_j)) means the probability of generating instance d given class c_j.

New cards

Define prior probability.

“Prior” means before seeing the evidence. It reflects what we believed about the class (e.g. c) in advance. Example: the fraction of the population that has the disease. Prior (p(c_j)) means the probability of occurrence of class c_j.

New cards

Define evidence.

Ensures the posterior probabilities across all classes sum to 1. Example: overall probability of observing those symptoms in the population. Evidence (p(d)) means the probability of instance d occurring.

New cards

List how the Bayes theorem would define itself if the posterior probability was the probability that you were male given your name was Drew, with classes male or female.

Bayes Theorem = Posterior Probability = Likelihood * Prior Probability / Evidence

Posterior Probability is the probability that you were male given your name was Drew (p(male | drew))

Likelihood is the probability of being called “Drew” given that you are male (p(drew | male))

Prior is the probability of being male (p(male))

Evidence is the probability of being named “Drew” (p(drew))

New cards

Practice Naive Bayes/Bayes Theorem using a random table. Use the one in the PDF.

Done.

New cards

Practice Naive Bayes/Bayes Theorem using a random table. Use the one in the PDF.

Done.

New cards

Practice Naive Bayes/Bayes Theorem using a random table. Use the one in the PDF.

Done.

100

New cards

How do you use Naive Bayes if there are many features so you can use all features?

Use the conditional independence assumption to factor the joint likelihood across features into a product of single-feature likelihoods. Estimate class priors by counting how many training examples belong to each class and dividing by the total. For each feature and class, estimate the conditional probability of the feature given the class: for categorical features, count how many class examples have that feature value, divide by the class count, and apply Laplace smoothing (add one to numerator, add number of feature values to denominator). For text or count features, use the multinomial variant based on term frequencies; for binary features, use the Bernoulli variant; for continuous features, fit a Gaussian with class-specific mean and variance. To classify, compute for each class the prior times the product of its feature likelihoods for the input (or equivalently, the log prior plus the sum of log likelihoods to avoid underflow), and predict the class with the highest resulting score.