Untitled Notes

Instructor: Ferdi Eruysal
Attendance: An attendance sheet is circulating; students must sign before leaving.
Assignment: An assignment was posted yesterday.
Questions: Instructor asks students about their understanding of the assignment.

Task Objective: Build two models - 1) Decision Tree (D3) and 2) Logistic Regression (LogReg).
Dataset: Telecom data focused on churn.
Hyperparameters Optimization:
- Students need to optimize specific hyperparameters mentioned by the instructor.
- Determine which metric to optimize for: accuracy, precision, recall, or F1 Score.
Comparison:
- Evaluate performance of Decision Tree against Logistic Regression.
- Students should work on the assignment for about 2 hours.
Feature Engineering:
- Students encouraged to create new columns from existing ones for potential bonus points.
- Basic assignments use existing columns; engineering may lead to improved model scores.
Report Requirements: Students must submit a brief report detailing their processes, findings, and include screenshots if necessary.

Classification Problems: Logistic regression is used to tackle classification problems, as opposed to linear regression which is used for regression-type problems.
Mechanism:
- The approach involves transitioning from a linear regression line to an S-shaped logistic curve to enhance prediction accuracy (0 or 1 outcomes).
- This transition disrupts the linear assumptions which require a new loss function.
Loss Function:
- Total error calculated in machine learning; important for measuring performance.

Overfitting:
- Occurs when the model is too complex and captures noise rather than the intended signal.
Regularization:
- A technique used to prevent overfitting, through methods like pruning (in decision trees) and imposing penalties on model complexity in logistic regression.
- Key Parameters:
- Lambda (λ) and Alpha (α) control model complexity.
- Alpha must always be between 0 and 1.
- Lambda can be any positive number.

Introduction: Introduction to KNN as a simple yet powerful machine learning algorithm.
Concept:
- The algorithm predicts a class based on the closest points.
- K Value: The number of neighboring data points considered for prediction. Odd values for K are preferred to avoid ties in votes for class prediction.
Distance Metric:
- Euclidean Distance Calculation: The methodology to measure the distance between data points:
  d = ext{sqrt}((a1 - b1)^2 + (a2 - b2)^2 + ext{…} + (an - bn)^2)
Data Types: KNN automatically handles one-hot encoding for categorical variables but may struggle with many categorical variables.

Small K Value Risks:
- May lead to overfitting since it closely follows the training data points.
Large K Value Issues:
- Results in a simplistic model that averages predictions across many neighbors, potentially missing nuanced insights.
Best K Value: Aim to find the K value that yields the highest performance across accuracy, precision, or recall depending on the problem context.

Hands-on Exercise:
- Assignments involve building decision trees, logistic regression, and KNN models.
- Each student must determine the best hyperparameters based on optimization goals set by the instructor.
- Focus on recall for churn prediction.
Weekly Materials:
- Available for downloading datasets and models.
Common Errors and Debugging:
- Address typical errors (e.g., Java-related program crashes) encountered during model building and parameter optimization.
Performance Assessment: Regularly check performance metrics (accuracy, precision, recall) to guide model refinement and ensure alignment with business needs.

Each student is tasked to find optimal hyperparameters for their individual assignments before the next class.
Instructor emphasized the importance of hands-on practice to reinforce understanding and mastery of machine learning concepts.
Questions from students encouraged throughout the learning process, with direct support provided by the instructor.