CN

Machine Learning Regression Techniques

Lecture Overview

  • The lecture will be recorded for students.

Assignment Details

  • Assignment 1

  • Project Group:

    • Groups of 3 to 4 (preferably 3)

    • Finalize groups by: 10/07

    • Format:

    • Proposal

    • Presentation

    • Final Report in IEEE Format (Consider using LaTeX)

    • Data Requirements:

    • Choose dataset with 3000+ samples

    • Focus on either Classification or Regression

    • Requirements:

    • Full Data Pipeline

    • Multiple models compared

Model Saving

  • Saving Models

  • Models can be saved using Pickle in Python:

    • To save the model to disk:
      python import pickle result_original = model.predict(X_test) pickle.dump(model, open('model_pickle.sav', 'wb'))

    • To load the model from disk:
      python loaded_model = pickle.load(open('model_pickle.sav', 'rb')) result_loaded = loaded_model.predict(X_test) print(result_original) print(result_loaded)

  • Example output shown (results) includes:

    • Original results:

    • [13.5757191, 17.09145661, 18.13667587, 14.90599816, 13.48069917]

    • Loaded results:

    • [13.5757191, 17.09145661, 18.13667587, 14.90599816, 13.48069917]

Advanced Linear Regression and Feature Selection

  • Course: SEIS 763-03, Spring 2025

Multiple Linear Regression

  • Multiple linear regression involves:

    • More than one independent variable.

    • Predicting a dependent variable.

    • Example: (Hours of Study, Attendance) → Exam Performance.

    • Evaluate Impacts: Quantifies how the dependent variable changes with respect to changes in independent variables while holding all other variables constant.

  • General formula for the regression model:
    y = w0 + w1x1 + w2x2 + … + wnx_n

Employee Dataset Example


  • Sample Data from employee.csv:

    Department

    WorkedHours

    Certification

    YearsExperience

    Salary


    Development

    2300

    0

    1.1

    39343


    Testing

    2100

    1

    1.3

    46205


    Development

    2104

    2

    1.5

    37731


    UX Designer

    1200

    1

    2

    43525


    Statistical Significance and Independent Variables

    • The decision can be made to use all independent variables or only statistically significant ones.

    • Statistical Significance Measurement:

      • p-value: Indicates the probability that observations are due only to chance. A variable is statistically significant if:

      • Its p-value is less than a predetermined significance level (usually $0.05$).

      • The lower the p-value, the stronger the evidence against the null hypothesis.

    Strategies for Including Variables

    1. Backward Elimination

    2. Forward Selection

    3. Hybrid (Stepwise)

    4. All Combinations

    Backward Elimination Steps
    1. Select an exit significance level threshold (e.g., $0.05$).

    2. Fit the model with remaining variables.

    3. Select the variable with the highest p-value:

      • If $(p-value > ext{significance threshold})$, go to Step 4.

      • Else, conclude the model.

    4. Remove the variable.

    5. Return to Step 2.

    Forward Selection Steps
    1. Select an entry significance level threshold (e.g., $0.05$).

    2. Create models of dependent variable $y$ with each independent variable $x_i$ and select the model with the lowest p-value.

    3. Keep the selected variable, then continue creating models adding one new variable each time.

    4. If the new variable's p-value is less than the significance threshold, return to Step 3. Else, select the previous model as final.

    Stepwise Selection Steps
    1. Select entry and exit significance level thresholds (e.g., $0.05$).

    2. Perform one step of Forward Selection (only add significant variables).

    3. Perform all steps of Backward Elimination (only retaining significant variables).

    4. Stop when no variables can be added or removed.

    Comparison of Selection Strategies

    • Forward Feature Selection:

      • Best for simplicity and computational efficiency but may miss the best model.

    • Backward Elimination:

      • Best for thoroughness, ensuring no important feature is missed initially but computationally expensive.

    • Stepwise Selection:

      • Balanced approach combining strengths of both methods, but complex and potentially intensive.

    All Combinations Approach
    1. Choose a goodness of fit criterion (e.g., $R^2$).

    2. Construct all possible models (totaling $2^N - 1$ models).

    3. Select the best model.

    Example OLS Regression Results
    • Conducted using Ordinary Least Squares (OLS):

      X_sig = x[:, [0, 1, 2, 3, 4]]
      obj_OLS = sm.OLS(endog=y, exog=X_sig).fit()
      obj_OLS.summary()  # Outputs the OLS regression summary
    
    • Example output (summary metrics):

      • Dependent Variable: $y$

      • $R^2$: 0.961

      • Adjusted $R^2$: 0.954

      • F-statistic: 152.6

      • Probability (F-statistic): $3.54e-17$

      • Log-Likelihood: -300.09

      • Observations and Degrees of Freedom…

    Polynomial Regression

    • Polynomial Regression refers to:
      y = w0 + w1x1 + w2x1^2 + … + wnx_n^n

    Decision Tree Regression


    • Example Decision Tree Data (e.g., for outlook conditions affecting hours played):

      Outlook

      Temperature

      Humidity

      Windy

      Hours Played


      Rainy

      Hot

      High

      False

      26


      Rainy

      Hot

      High

      True

      30


      … (More data shown)

      Tree Construction Methodology

      • Decision trees are constructed in a top-down approach, partitioning data into subsets containing similar values (homogenous).

      • Standard deviation is used to gauge homogeneity.

      • An example was given using hour-played data to compute standard deviation:

        • For count = 14, mean = 39.80, standard deviation calculated: $S = 9.32$.

      Standard Deviation Reduction
      • Key Insight: The goal in constructing decision trees is to minimize standard deviation in data splits.

      • Example standard deviation reductions calculated as part of the decision-making process for constructing trees.

      Final Remarks and Code Example
      • Closing notes included plotting results from decision trees and assessing various models based on non-linear relationships using regression analysis.

      • Example code provided to create and fit a decision tree using the Sklearn library:

        from sklearn.tree import DecisionTreeRegressor
        model = DecisionTreeRegressor()
        model.fit(x,y)
      

      Q&A Segment

      • Open floor for questions at the end of the lecture.