CN

Machine Learning Concepts and Data Preparation

Data Representation

  • Structured vs Unstructured Data (Page 3)

    • Structured Data: Data that is organized in a tabular format (rows and columns) with predefined schema. Examples include databases, spreadsheets (like loans.csv). It's easy to store, access, and process for machine learning models.

    • Unstructured Data: Data that does not have a predefined format or organization. Examples include text documents, images, audio files, and videos. This often requires pre-processing to extract structured features before it can be used in many traditional ML models.

  • Numeric representation of data

    • Feature Matrix (X): This is a matrix of size n \times d where n represents the number of samples (or observations/rows) in the dataset, and d represents the number of features (or independent variables/columns). Each row corresponds to a single data point, and each column represents a specific feature. In supervised learning, the model learns patterns from X to predict the target variable.

    • Label Vector (y): This is a vector of size n \times 1 (or just (n,) in numpy shape notation) containing the dependent variable (or target/label). Each element corresponds to the label for the respective sample in the feature matrix X. For classification, y would contain categories; for regression, it would contain continuous numeric values.

  • Independent vs Dependent variables

    • Independent Variable(s) (X): These are the input features or attributes that are used to predict or explain the changes in the dependent variable. In the context of loans.csv, City, Age, and Salary are independent variables.

    • Dependent Variable (y): This is the output or target variable that we aim to predict or explain. Its value depends on the independent variables. In loans.csv, Approved is the dependent variable.

  • Python indexing start from 0 (Page 6 reminder)

    • This is a fundamental concept in Python and many programming languages, meaning the first element in a sequence (like a list, array, or DataFrame column) is accessed with index 0.

  • Data representation terms in slides

    • Feature Matrix (X): As described above, the input data with n samples and d features.

    • Label Vector (y): As described above, the target variable with n samples and 1 target per sample.

Data Preparation

  • Typical data preparation steps (Page 9, 12): These steps are crucial for transforming raw data into a format suitable for machine learning models, ensuring quality and improving model performance.

    • Getting the necessary Python libraries: Importing libraries like pandas for data manipulation, numpy for numerical operations, and sklearn for machine learning tools (e.g., SimpleImputer, StandardScaler, train_test_split).

    • Loading the dataset: Reading your raw data (e.g., from a CSV file into a pandas DataFrame) to make it accessible for processing.

    • Extracting the independent (X) and dependent (y) variables: Separating the input features (X) from the target variable (y), which is a prerequisite for supervised learning models.

    • Dealing with categorical features with One Hot Encoding (OHE): Converting categorical textual or numerical values into a numerical format that machine learning algorithms can understand, without imposing an artificial order. OHE creates new binary features for each category.

    • Splitting the data into Training and Testing sets: Dividing the dataset into two distinct subsets: a training set (used to train the model) and a testing set (used to evaluate the model's performance on unseen data). This helps assess the model's generalization ability and detect overfitting.

    • Dealing with Missing values: Identifying and handling null or missing entries in the dataset. This is essential because many ML algorithms cannot process missing data and can lead to biased or incorrect results if not addressed.

    • Feature Scaling: Adjusting the scale of numerical features so they contribute equally to the model's learning process. This is particularly important for algorithms sensitive to feature magnitudes, like gradient descent-based methods or distance-based algorithms.

    • Note: The order of these steps can vary. For example, missing value imputation might occur before or after splitting, but typically imputation parameters are fit on training data only to prevent data leakage.

  • Dataset example used in slides: loans.csv

    • Columns: City, Age, Salary, Approved.

    • Sample rows include missing values and binary target 'Approved' (Yes/No): This dataset serves as a practical example to demonstrate common data preparation challenges, such as handling NaN values, encoding categorical features (City), and preparing a binary target variable (Approved).

    • Purpose: Illustrate the necessity of handling categorical data, missing values, and encoding target variables for a typical classification problem.

Independent and Dependent Variables (Example)

  • Typical setup from slides (Page 13): For the loans.csv dataset, the goal is to predict loan approval based on individual characteristics.

    • Independent/Features: City, Age, Salary. These are the attributes used as inputs to the model.

    • Dependent/Labels: Approved. This is the binary outcome (Yes/No) that the model will try to predict.

  • Code sketch from Page 14 shows:

    • X = df.iloc[:,:-1] (all columns except the last one as features): This pandas command selects all rows (:) and all columns from the beginning up to, but not including, the last column (:-1) to form the feature matrix X.

    • y = df.iloc[:, -1] (last column as label): This selects all rows (:) and only the very last column (-1) to form the label vector y.

  • Shapes observed in example:

    • Shape of Independent X matrix: (10, 3). This signifies that there are 10 samples (rows) and 3 features (columns) in our loans.csv example before specific preprocessing like One-Hot Encoding.

    • Shape of Dependent y vector: (10,) (or (10, 1)). This indicates there are 10 corresponding labels for the 10 samples.

  • Example of y values (Yes/No) and potential NaNs in features (Pages 14–15): The raw loans.csv data contains categorical values like 'Yes'/'No' in Approved and potentially NaN (Not a Number) values in numerical columns like Age or Salary, or even in City, which all require careful preprocessing.

Handling Missing Values

  • Typical approach: impute missing values using SimpleImputer: Missing values can lead to errors or biased models. Imputation replaces these missing entries with estimated values, making the dataset complete and usable. SimpleImputer is a common scikit-learn tool for this task.

  • Example workflow (Page 50):

    • `imputer = SimpleImputer(missing_values=np.nan, strategy=