Machine Learning Regression Techniques
Lecture Overview
The lecture will be recorded for students.
Assignment Details
Assignment 1
Project Group:
Groups of 3 to 4 (preferably 3)
Finalize groups by: 10/07
Format:
Proposal
Presentation
Final Report in IEEE Format (Consider using LaTeX)
Data Requirements:
Choose dataset with 3000+ samples
Focus on either Classification or Regression
Requirements:
Full Data Pipeline
Multiple models compared
Model Saving
Saving Models
Models can be saved using Pickle in Python:
To save the model to disk:
python import pickle result_original = model.predict(X_test) pickle.dump(model, open('model_pickle.sav', 'wb'))
To load the model from disk:
python loaded_model = pickle.load(open('model_pickle.sav', 'rb')) result_loaded = loaded_model.predict(X_test) print(result_original) print(result_loaded)
Example output shown (results) includes:
Original results:
[13.5757191, 17.09145661, 18.13667587, 14.90599816, 13.48069917]
Loaded results:
[13.5757191, 17.09145661, 18.13667587, 14.90599816, 13.48069917]
Advanced Linear Regression and Feature Selection
Course: SEIS 763-03, Spring 2025
Multiple Linear Regression
Multiple linear regression involves:
More than one independent variable.
Predicting a dependent variable.
Example: (Hours of Study, Attendance) → Exam Performance.
Evaluate Impacts: Quantifies how the dependent variable changes with respect to changes in independent variables while holding all other variables constant.
General formula for the regression model:
y = w0 + w1x1 + w2x2 + … + wnx_n
Employee Dataset Example
Sample Data from employee.csv:
Department
WorkedHours
Certification
YearsExperience
Salary
Development
2300
0
1.1
39343
Testing
2100
1
1.3
46205
Development
2104
2
1.5
37731
UX Designer
1200
1
2
43525
…
Statistical Significance and Independent Variables
The decision can be made to use all independent variables or only statistically significant ones.
Statistical Significance Measurement:
p-value: Indicates the probability that observations are due only to chance. A variable is statistically significant if:
Its p-value is less than a predetermined significance level (usually $0.05$).
The lower the p-value, the stronger the evidence against the null hypothesis.
Strategies for Including Variables
Backward Elimination
Forward Selection
Hybrid (Stepwise)
All Combinations
Backward Elimination Steps
Select an exit significance level threshold (e.g., $0.05$).
Fit the model with remaining variables.
Select the variable with the highest p-value:
If $(p-value > ext{significance threshold})$, go to Step 4.
Else, conclude the model.
Remove the variable.
Return to Step 2.
Forward Selection Steps
Select an entry significance level threshold (e.g., $0.05$).
Create models of dependent variable $y$ with each independent variable $x_i$ and select the model with the lowest p-value.
Keep the selected variable, then continue creating models adding one new variable each time.
If the new variable's p-value is less than the significance threshold, return to Step 3. Else, select the previous model as final.
Stepwise Selection Steps
Select entry and exit significance level thresholds (e.g., $0.05$).
Perform one step of Forward Selection (only add significant variables).
Perform all steps of Backward Elimination (only retaining significant variables).
Stop when no variables can be added or removed.
Comparison of Selection Strategies
Forward Feature Selection:
Best for simplicity and computational efficiency but may miss the best model.
Backward Elimination:
Best for thoroughness, ensuring no important feature is missed initially but computationally expensive.
Stepwise Selection:
Balanced approach combining strengths of both methods, but complex and potentially intensive.
All Combinations Approach
Choose a goodness of fit criterion (e.g., $R^2$).
Construct all possible models (totaling $2^N - 1$ models).
Select the best model.
Example OLS Regression Results
Conducted using Ordinary Least Squares (OLS):
X_sig = x[:, [0, 1, 2, 3, 4]] obj_OLS = sm.OLS(endog=y, exog=X_sig).fit() obj_OLS.summary() # Outputs the OLS regression summary
Example output (summary metrics):
Dependent Variable: $y$
$R^2$: 0.961
Adjusted $R^2$: 0.954
F-statistic: 152.6
Probability (F-statistic): $3.54e-17$
Log-Likelihood: -300.09
Observations and Degrees of Freedom…
Polynomial Regression
Polynomial Regression refers to:
y = w0 + w1x1 + w2x1^2 + … + wnx_n^n
Decision Tree Regression
Example Decision Tree Data (e.g., for outlook conditions affecting hours played):
Outlook
Temperature
Humidity
Windy
Hours Played
Rainy
Hot
High
False
26
Rainy
Hot
High
True
30
… (More data shown)
Tree Construction Methodology
Decision trees are constructed in a top-down approach, partitioning data into subsets containing similar values (homogenous).
Standard deviation is used to gauge homogeneity.
An example was given using hour-played data to compute standard deviation:
For count = 14, mean = 39.80, standard deviation calculated: $S = 9.32$.
Standard Deviation Reduction
Key Insight: The goal in constructing decision trees is to minimize standard deviation in data splits.
Example standard deviation reductions calculated as part of the decision-making process for constructing trees.
Final Remarks and Code Example
Closing notes included plotting results from decision trees and assessing various models based on non-linear relationships using regression analysis.
Example code provided to create and fit a decision tree using the Sklearn library:
from sklearn.tree import DecisionTreeRegressor model = DecisionTreeRegressor() model.fit(x,y)
Q&A Segment
Open floor for questions at the end of the lecture.