Machine Learning Regression Techniques

Lecture Overview

The lecture will be recorded for students.

Assignment Details

Assignment 1
Project Group:
- Groups of 3 to 4 (preferably 3)
- Finalize groups by: 10/07
- Format:
- Proposal
- Presentation
- Final Report in IEEE Format (Consider using LaTeX)
- Data Requirements:
- Choose dataset with 3000+ samples
- Focus on either Classification or Regression
- Requirements:
- Full Data Pipeline
- Multiple models compared

Model Saving

Saving Models
Models can be saved using Pickle in Python:
- To save the model to disk:
  python import pickle result_original = model.predict(X_test) pickle.dump(model, open('model_pickle.sav', 'wb'))
- To load the model from disk:
  python loaded_model = pickle.load(open('model_pickle.sav', 'rb')) result_loaded = loaded_model.predict(X_test) print(result_original) print(result_loaded)
Example output shown (results) includes:
- Original results:
- [13.5757191, 17.09145661, 18.13667587, 14.90599816, 13.48069917]
- Loaded results:
- [13.5757191, 17.09145661, 18.13667587, 14.90599816, 13.48069917]

Advanced Linear Regression and Feature Selection

Course: SEIS 763-03, Spring 2025

Multiple Linear Regression

Multiple linear regression involves:
- More than one independent variable.
- Predicting a dependent variable.
- Example: (Hours of Study, Attendance) → Exam Performance.
- Evaluate Impacts: Quantifies how the dependent variable changes with respect to changes in independent variables while holding all other variables constant.
General formula for the regression model:
y = w0 + w1x1 + w2x2 + … + wnx_n

Employee Dataset Example

Sample Data from employee.csv:
Department
WorkedHours
Certification
YearsExperience
Salary

Development
2300
0
1.1
39343

Testing
2100
1
1.3
46205

Development
2104
2
1.5
37731

UX Designer
1200
1
2
43525

…
Statistical Significance and Independent Variables
- The decision can be made to use all independent variables or only statistically significant ones.
- Statistical Significance Measurement:
  - p-value: Indicates the probability that observations are due only to chance. A variable is statistically significant if:
  - Its p-value is less than a predetermined significance level (usually $0.05$).
  - The lower the p-value, the stronger the evidence against the null hypothesis.
Strategies for Including Variables
1. Backward Elimination
2. Forward Selection
3. Hybrid (Stepwise)
4. All Combinations
Backward Elimination Steps
1. Select an exit significance level threshold (e.g., $0.05$).
2. Fit the model with remaining variables.
3. Select the variable with the highest p-value:
  - If $(p-value > ext{significance threshold})$, go to Step 4.
  - Else, conclude the model.
4. Remove the variable.
5. Return to Step 2.
Forward Selection Steps
1. Select an entry significance level threshold (e.g., $0.05$).
2. Create models of dependent variable $y$ with each independent variable $x_i$ and select the model with the lowest p-value.
3. Keep the selected variable, then continue creating models adding one new variable each time.
4. If the new variable's p-value is less than the significance threshold, return to Step 3. Else, select the previous model as final.
Stepwise Selection Steps
1. Select entry and exit significance level thresholds (e.g., $0.05$).
2. Perform one step of Forward Selection (only add significant variables).
3. Perform all steps of Backward Elimination (only retaining significant variables).
4. Stop when no variables can be added or removed.
Comparison of Selection Strategies
- Forward Feature Selection:
  - Best for simplicity and computational efficiency but may miss the best model.
- Backward Elimination:
  - Best for thoroughness, ensuring no important feature is missed initially but computationally expensive.
- Stepwise Selection:
  - Balanced approach combining strengths of both methods, but complex and potentially intensive.
All Combinations Approach
1. Choose a goodness of fit criterion (e.g., $R^2$).
2. Construct all possible models (totaling $2^N - 1$ models).
3. Select the best model.
Example OLS Regression Results
- Conducted using Ordinary Least Squares (OLS):
```
  X_sig = x[:, [0, 1, 2, 3, 4]]
  obj_OLS = sm.OLS(endog=y, exog=X_sig).fit()
  obj_OLS.summary()  # Outputs the OLS regression summary
```
- Example output (summary metrics):
  - Dependent Variable: $y$
  - $R^2$: 0.961
  - Adjusted $R^2$: 0.954
  - F-statistic: 152.6
  - Probability (F-statistic): $3.54e-17$
  - Log-Likelihood: -300.09
  - Observations and Degrees of Freedom…
Polynomial Regression
- Polynomial Regression refers to:
  y = w0 + w1x1 + w2x1^2 + … + wnx_n^n
Decision Tree Regression
- Example Decision Tree Data (e.g., for outlook conditions affecting hours played):
  Outlook
  Temperature
  Humidity
  Windy
  Hours Played
  
  Rainy
  Hot
  High
  False
  26
  
  Rainy
  Hot
  High
  True
  30
  
  … (More data shown)
  Tree Construction Methodology
  - Decision trees are constructed in a top-down approach, partitioning data into subsets containing similar values (homogenous).
  - Standard deviation is used to gauge homogeneity.
  - An example was given using hour-played data to compute standard deviation:
    - For count = 14, mean = 39.80, standard deviation calculated: $S = 9.32$.
  Standard Deviation Reduction
  - Key Insight: The goal in constructing decision trees is to minimize standard deviation in data splits.
  - Example standard deviation reductions calculated as part of the decision-making process for constructing trees.
  Final Remarks and Code Example
  - Closing notes included plotting results from decision trees and assessing various models based on non-linear relationships using regression analysis.
  - Example code provided to create and fit a decision tree using the Sklearn library:
```
  from sklearn.tree import DecisionTreeRegressor
  model = DecisionTreeRegressor()
  model.fit(x,y)
```
  Q&A Segment
  - Open floor for questions at the end of the lecture.

Department	WorkedHours	Certification	YearsExperience	Salary
Development	2300	0	1.1	39343
Testing	2100	1	1.3	46205
Development	2104	2	1.5	37731
UX Designer	1200	1	2	43525
…

Statistical Significance and Independent Variables

Outlook	Temperature	Humidity	Windy	Hours Played
Rainy	Hot	High	False	26
Rainy	Hot	High	True	30
… (More data shown)

Tree Construction Methodology

Machine Learning Regression Techniques

Lecture Overview

Assignment Details

Model Saving

Advanced Linear Regression and Feature Selection

Multiple Linear Regression

Employee Dataset Example

Statistical Significance and Independent Variables

Strategies for Including Variables

Backward Elimination Steps

Forward Selection Steps

Stepwise Selection Steps

Comparison of Selection Strategies

All Combinations Approach

Example OLS Regression Results

Polynomial Regression

Decision Tree Regression

Tree Construction Methodology

Standard Deviation Reduction

Final Remarks and Code Example

Q&A Segment