D

20250415 - part 2

TOPICS (Cues and Questions)

  • EDA

    • Imbalanced Datasets

      • Random Oversampling

        • Increase the minority class to be equal to the majority class

      • Random Undersampling

        • Resample all classes but the minority

        • Make the majority class equal to the minority class

          • When binary data values are outnumbered?

      • ? imbalanced-learn library? (What’s the purpose of this?) → RUS ROS?

      • ? Whats the importance of balanced datasets or not?

    • pd.numeric() → Force a column to a data type, causing issues to NaN

    • ? What does this syntax do? print(so_survey_df['RawSalary'][idx])?

    • Method Chaining

    • ? What is Seaborn? import seaborn as sns → matplotlib wrapper? → Whats a wrapper?

    • ? What is a parquet file?

    • .astype('timedelta64[s]')

    • ? Correlation Matrix → Provides a comprehensive overview of the pairwise correlations between variables in a dataset, allowing practitioners to quantify and visualize the strength and direction of the associations.

    • ? What is this code doing? [c for c in df.columns if "Time" in c]

    • Summarizing data with queries

    • Labels vs columns vs indexes?

    • Series name=?

    • reviews = pd.read_csv("../input/wine-reviews/winemag-data_first150k.csv", index_col=0), index_col=0

    • Multicolinearity → two or more independent variables in a model are highly correlated.

    • Pairplots AKA, scatterplot matrices → visualizations of pairwise relationships between multiple variables simultaneously.

  • SQL

    • What is a query?

    • COUNT()

    • SQL is not processed in the order it’s written

      • FROM

      • SELECT

      • LIMIT

Code

# Random Over Sampling
ros = RandomOverSampler(sampling_strategy=1) # Float
ros = RandomOverSampler(sampling_strategy="not majority") # String
X_res, y_res = ros.fit_resample(X, y)

# FILLING MISSING VALUES
mean_value = df['ConvertedSalary'].mean()
df['ConvertedSalary'] = df['ConvertedSalary'].fillna(mean_value).astype(int)

# METHOD CHAINING
df['column'] = df['column'].method1().method2().method3()

# FEATURE ENGINEER A TAXI DURATION (MINS) BY PICKUP AND DROPOFF TIMES
df["duration"] = ((df["tpep_dropoff_datetime"] - df["tpep_pickup_datetime"]) / 60).astype('timedelta64[s]')

# A DATAFRAME WITH INDEXES
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

SQL

SELECT COUNT(names) AS count_names, COUNT(birthdate) AS count_birthdate
FROM people;

SELECT COUNT(DISTINCT birthdate) AS count_distinct_birthdate
FROM people;

pd.isna()

Detects missing values in a DataFrame or Series. It returns a DF or Series of boolean values where True indicates a missing value. (NaN, None, or NaT)

df.astype()

pd.sample()

pull a random subset of rows from the entire dataset. Can provide a number or fraction.

df.loc[]

Filter DF based on Rows and based on location of data within DF but loc[] uses the names of columns and rows.

df.iloc[]

Filter DF based on Rows and based on location of data within DF but iloc[] uses index location


Notes

  • Predictive Models: When working with train and test datasets, calculate the mean or median on the train set and apply it to both datasets to avoid data leakage.

  • One approach to finding these values is to force the column to the data type desired using pd.to_numeric(), coercing any values causing issues to NaN, Then filtering the DataFrame by just the rows containing the NaN values.

  • Exploratory data analysis serves as the initial step in any data-driven investigation, offering a comprehensive understanding of the dataset's structure, distributions, and relationships between variables.

  • Feature engineering is a critical process in machine learning that involves selecting, transforming, and creating features from raw data to improve the performance and accuracy of predictive models.

  • Correlation Matrix:

    1. Relationship Assessment: The correlation matrix provides a quantitative measure of the relationship between variables. By calculating correlation coefficients, typically using methods like Pearson's correlation coefficient, analysts can determine if variables are positively correlated (increase together), negatively correlated (one increases as the other decreases), or uncorrelated (no systematic relationship). These measures offer insights into the direction and strength of the relationships, helping to identify important variables that may influence the target variable.

    2. Feature Selection: Correlation matrices are invaluable in feature selection, which involves identifying the most relevant variables for building predictive models. By examining the correlations between the target variable and other features, analysts can identify highly correlated variables that may be strong predictors. This knowledge enables informed decisions regarding which variables to include in the model, potentially reducing dimensionality, enhancing model efficiency, and preventing issues such as multicollinearity.

    3. Multicollinearity Detection: Multicollinearity occurs when two or more independent variables in a model are highly correlated. This can lead to problems such as instability in coefficient estimates, difficulty in interpreting feature importance, and reduced model robustness. By examining the correlation matrix, analysts can identify highly correlated variables and make informed decisions about which ones to include or exclude to mitigate multicollinearity. Removing redundant variables improves model interpretability and generalization.

  • The syntax for declaring a new DataFrame is a dictionary whose keys are the column names (Bob and Sue in this example), and whose values are a list of entries.

Correlation Matrix and Heatmap

df2 = df[["trip_distance", "fare_amount", "tip_amount", "total_amount", "duration"]]
correlation_matrix = df2.corr()

# Create a heatmap of the correlation matrix
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.show()

  • Variable Relationships: Pairplots enable analysts to explore the relationships between variables, revealing patterns such as linear or nonlinear correlations, clusters, or other associations. These visual cues guide further analysis, feature selection, or modeling decisions.

  • Multivariate Analysis: Pairplots help identify multivariate dependencies and interactions, highlighting how different variables jointly influence one another. This is particularly valuable in identifying potential confounding factors or discovering hidden interactions that may not be apparent when considering variables in isolation.

  • Outlier Detection: Pairplots can reveal potential outliers by showing data points that deviate significantly from the general pattern observed between variables. Outliers can indicate data anomalies or influential observations that may impact model performance.

  • Feature Importance: Pairplots provide an intuitive representation of the relative importance of different features. Variables exhibiting strong correlations or clear patterns may be more relevant for predictive modeling or feature selection.

  • Data Quality: Pairplots can help identify data quality issues, such as data entry errors or measurement inconsistencies. Patterns that do not align with expectations or exhibit unusual trends may signal data problems that require further investigation or preprocessing.

A count plot is a type of categorical plot that displays the number of occurrences of each category in a dataset. It is particularly useful for visualizing the distribution and frequency of categorical variables. Here are some key uses and benefits of count plots:

  • Categorical Variable Exploration: Count plots provide a quick and concise summary of the distribution of categorical variables. They allow analysts to understand the frequency or count of each category, revealing the proportions and imbalances within the dataset. This information is crucial for gaining insights into the composition and characteristics of categorical variables.

  • Class Imbalance Assessment: In classification tasks, count plots help assess the balance or imbalance of different target classes. It provides a visual representation of the distribution of classes, highlighting any significant discrepancies in the sample sizes across categories. Identifying imbalanced classes is important in machine learning as it can affect model performance and bias the predictions towards the majority class.

  • Data Quality Inspection: Count plots can be utilized to detect data quality issues in categorical variables. It allows analysts to identify unexpected or erroneous categories that may indicate data entry errors, missing data, or inconsistencies in the dataset. By observing the counts for each category, anomalies or discrepancies can be easily spotted, enabling data cleaning or further investigation if necessary.

  • Feature Importance Evaluation: Count plots can provide insights into the importance or relevance of different categorical features in relation to the target variable. By visualizing the distribution of categories within each class or target level, analysts can determine which categories are more prevalent or have higher frequencies for specific outcomes. This understanding helps in assessing the discriminatory power of categorical features and their potential impact on predictive models.

A box plot, also known as a box-and-whisker plot, is a powerful visualization tool for displaying the distribution, variability, and outliers within a numerical dataset. It provides a concise summary of key statistical measures and offers several important uses:

  • Data Distribution and Skewness: Box plots offer a visual representation of the distribution of numerical data, providing insights into its central tendency, spread, and skewness. The box represents the interquartile range (IQR), which contains the middle 50% of the data, with the median indicated by a horizontal line within the box. By observing the length and symmetry of the box, analysts can assess whether the data is skewed or symmetrically distributed.

  • Outlier Detection: Box plots are highly effective in identifying outliers, which are data points that deviate significantly from the rest of the distribution. The whiskers of the plot extend to the minimum and maximum non-outlier values, with any data points beyond the whiskers considered as potential outliers. Outliers can indicate data errors, anomalies, or important observations requiring further investigation.

  • Comparing Groups or Categories: Box plots are useful for comparing the distributions of numerical data across different groups or categories. By creating side-by-side or grouped box plots, analysts can easily compare the central tendencies, spreads, and shapes of distributions between different groups. This allows for the identification of differences, similarities, or patterns within the data.

  • Variability and Spread: Box plots provide insights into the variability and spread of the data. The length of the box indicates the spread of the middle 50% of the data, while the whiskers show the range of non-outlier values. By comparing the lengths of the boxes and whiskers, analysts can assess the relative variability between different groups or categories, aiding in the understanding of the data's dispersion.

  • Skewedness and Symmetry: Box plots offer a visual assessment of the skewness or symmetry of the data distribution. A symmetrical distribution is represented by a box plot with an equal length on both sides of the median, while a skewed distribution is indicated by a longer box on one side. This visual cue helps in understanding the shape and characteristics of the data, assisting in further analysis and modeling decisions.

  • Data Range and Quartiles: Box plots display the quartiles of the data distribution. The lower quartile (Q1) represents the 25th percentile, the upper quartile (Q3) represents the 75th percentile, and the interquartile range (IQR) is the range between Q1 and Q3. These quartiles provide a summary of the range and spread of the central portion of the data, aiding in the understanding of the data's variability and dispersion.

A histogram is a graphical representation that displays the distribution of a continuous or discrete numerical variable. It provides insights into the underlying data distribution and helps uncover patterns, frequencies, and ranges within the dataset. Here are some key uses and benefits of histogram plots:

  • Data Distribution: Histograms allow analysts to visualize the shape, central tendency, and spread of the data. They provide an overview of the data distribution, helping to identify if it follows a particular pattern, such as a normal distribution, skewed distribution, bimodal distribution, or multimodal distribution. Understanding the data distribution aids in selecting appropriate analysis techniques and understanding the characteristics of the data.

  • Frequency Analysis: Histograms display the frequency or count of data points within predefined bins or intervals along the x-axis. By observing the height or count of each bin, analysts can identify the frequency of occurrence for different values or value ranges. This information helps assess the concentration of data points and identify peaks or modes in the distribution.

  • Outlier Detection: Histograms can assist in identifying outliers, which are data points that significantly deviate from the rest of the distribution. Outliers may indicate data errors, unusual observations, or important anomalies requiring further investigation. By examining the tails or extreme values in the histogram, analysts can identify potential outliers that may require additional scrutiny.

  • Data Range and Spread: Histograms provide insights into the range and spread of the data. The x-axis represents the variable's values, while the y-axis represents the frequency or count of occurrences. By observing the width and span of the histogram, analysts can assess the data's range and variability. This information helps understand the data's spread and aids in subsequent analysis or decision-making processes.

  • Feature Engineering: Histograms can guide feature engineering processes by informing appropriate transformations, binning strategies, or encoding techniques for numerical variables. They assist in identifying nonlinear relationships or determining optimal cut-off points for converting continuous variables into categorical ones. Histograms also help identify data skewness and guide transformation methods to address the skew if necessary.

  • Data Quality Inspection: Histograms can be useful in detecting data quality issues or anomalies. Unusual spikes, gaps, or unexpected patterns in the histogram may indicate data entry errors, measurement inconsistencies, or missing data. By observing the histogram, analysts can identify potential data quality issues that require further investigation or preprocessing.


Feynman Technique Summary:

  1. Summary 1

  2. Summary 2


AI Assistance

🧠 Key Highlights (5–7 concepts distilled)

  • Class imbalance can mislead models; use imbalanced-learn to oversample or undersample for fairness.

  • Method chaining allows you to streamline multiple transformations in a single expression.

  • pd.to_numeric(..., errors='coerce') is a tool for cleaning dirty numeric columns, converting problematic entries to NaN.

  • Correlation matrices help detect feature relationships and multicollinearity.

  • Visual tools (heatmaps, boxplots, histograms, pairplots) give you quick feedback on distribution, relationships, and outliers.

  • .astype('timedelta64[s]') converts datetime differences to seconds — useful for calculating durations.

  • SQL queries are not read top-down — the logical order matters: FROM → WHERE → GROUP BY → SELECT → ORDER BY → LIMIT.


3 Critical Edge Cases

  1. Oversampling before splitting your data can cause data leakage.

  2. Correlation ≠ Causation — just because two variables are correlated doesn’t mean one causes the other.

  3. Dropping missing values (dropna) in imbalanced datasets can make minority classes vanish entirely.


3 Common Errors

  1. Using df.drop() without specifying axis or labels.

  2. Applying fillna() with inplace=True and forgetting it overwrites the original DataFrame.

  3. Assuming iloc[] accepts column names — it only takes integer positions.


Quiz

  1. What’s the difference between Random Oversampling and Random Undersampling? When would you use each?

  • My Answer:

    • The simplest answer here is that Random undersampling is when we bring the minority of data up to match the majority of the date. Random oversampling is the opposite. We might want to use it when we notice a skewed distribution? I’m not really sure about when I would want to use it or what problem it solves.

  • CGPT:

    • Random Oversampling: Duplicates samples from the minority class to balance the dataset.

    • Random Undersampling: Removes samples from the majority class to achieve balance.

    • Oversampling is better to use when data is scarce and you don’t want to lose valuable information.

    • Undersampling is useful when you have a large dataset, but risk losing variety or introducing bias if overused.

  1. Why is it risky to use fillna() with inplace=True during EDA or data cleaning?

  • My Answer:

    • inplace=True changes the original DataFrame so when we do any sort of data cleaning, adding or removing to the original dataset, it can affect the effectiveness of the model.

  • CGPT:

    • The danger isn’t directly about model effectiveness, but more about losing the original, unmodified data. If you overwrite it with changes and later realize there was a mistake, you can’t easily go back unless you made a backup.

    • It’s specifically difficult to fix changed data within a Jupyter notebook or interactive session. You may transform data step by step without reloading, meaning that you’d have to go back, earlier in the notebook, and reload or re-import the original dataset manually, which wastes time and risks inconsistency if the source changes.

  1. What’s the purpose of a correlation matrix in EDA, and how does it relate to feature selection?

  • My Answer:

    • A correlation matrix can be used to show data analysts which features might be correlated strongly either positively, negatively, or not at all. Plotting this into something visual can allow us to see and better understand which features we choose will affect the performance of the model, perhaps bias.

  • CGPT:

    •  Well done — your answer captures the core idea of a correlation matrix and how it relates to feature selection!

    • Correlation matrices quantify how strongly two variables move together (from -1 to 1).

    • High correlation between features (not just with the target) can lead to multicollinearity, which makes it harder for models to assign importance to each feature.

    • Visualizing it with a heatmap (like sns.heatmap) helps quickly spot patterns, such as redundant or highly influential variables.


  1. 💻 Code Question:

You have a DataFrame df with a column "income" that contains dirty values like “50,000” or “42000”.

Write one line of code to clean the income column by:

  • Removing commas and spaces

  • Converting it to numeric

  • Coercing errors to NaN

My Answer:

df["income"].str.replace(",", "").to_numeric(errors="coerce") 

CGPT:

  • Verdict: Wrong.

    • .str.replace(",", " ") should replace commas with "" (an empty string), not a space.

    • pd.to_numeric() is not a string method, so it must be called as pd.to_numeric(...), not chained directly after .str.replace(...).

# CORRECT CODE
df["income"] = pd.to_numeric(df["income"].str.replace(",", "").str.strip(), errors="coerce")
  • Removes commas → .str.replace(",", "")

  • Removes trailing spaces → .strip.strip()

  • Converts to numeric and handles errors → pd.to_numeric(..., errors="coerce")


5. 💻 Code Question:

You want to create a new column in your DataFrame called duration by subtracting pickup_time and dropoff_time, and convert it to minutes.

Write a one-liner to create the duration column.

My Answer:

df["duration"] = df["dropoff_time"] - df["pickup_time"] / 60).astype('timedelta64[s]')

CGPT:

  • Verdict: Wrong.

    • df["dropoff_time"] - df["pickup_time"] / 60) — missing an open paren before the subtraction.

    • You can’t divide a timedelta by 60 before converting it to seconds — it’s already a time difference.

    • .astype('timedelta64[s]') is used on datetime differences, not scalar divisions.

# CORRECT CODE
df["duration"] = (df["dropoff_time"] - df["pickup_time"]).dt.total_seconds() / 60
  • dt.total_seconds() gives you the full difference in seconds.

  • Then you divide by 60 to get minutes.


  1. 🌍 Real-World Use Question 6:

You're working with a dataset of customer transactions. The "purchase_time" and "delivery_time" columns are both datetime objects. However, some "delivery_time" values are missing due to failed deliveries.

You want to:

  1. Create a "delivery_duration" column in minutes.

  2. Keep rows even if "delivery_time" is missing. (Having NaN or NaT is the result is allowed.)

  3. Later analyze which rows are missing delivery data.

What would your approach be?

Answer in 2–3 sentences.

My Answer:

  • N/A

CGPT:=

  1. Calculate the delivery duration:

df["delivery_duration"] = (df["delivery_time"] - df["purchase_time"]).dt.total_seconds() / 60

This will preserve rows with missing "delivery_time" — those entries in "delivery_duration" will just be NaN.

  1. To analyze missing deliveries later:

missing_deliveries = df[df["delivery_time"].isna()]

This way, no data is lost, and you can still explore incomplete deliveries.