🔥 Real ML Case Study Coding Drill — End-to-End

Problem Scenario:

You are given a customer dataset to predict whether a customer will churn.

The data is raw, messy, and contains:

Numerical, categorical, date columns
Missing values
Outliers

📂 Sample Dataset

python

CopyEdit

import pandas as pd

import numpy as np

# Simulated dataset

data = { 'customer_id': [1, 2, 3, 4, 5], 'signup_date': ['2021-01-01', '2020-05-10', '2019-08-15', '2021-02-20', '2020-12-01'], 'last_login': ['2022-12-01', '2021-06-01', '2022-01-01', None, '2022-11-20'], 'plan': ['basic', 'premium', 'basic', 'basic', 'premium'], 'monthly_charges': [29, 79, 35, None, 79], 'total_charges': [400, 1200, 700, 800, None], 'churn': [0, 1, 0, 1, 0] }

df = pd.DataFrame(data)

🚩 Step 1: Data Loading & Initial Inspection

python

CopyEdit

# Convert dates

df["signup_date"] = pd.to_datetime(df["signup_date"])

df["last_login"] = pd.to_datetime(df["last_login"])

# Quick look at data

print(df.info())

print(df.describe(include="all"))

🚩 Step 2: Handling Missing Values

python

CopyEdit

# Fill missing numeric columns with median

df["monthly_charges"].fillna(df["monthly_charges"].median(), inplace=True)

df["total_charges"].fillna(df["total_charges"].median(), inplace=True)

# Fill missing last login with today's date

df["last_login"].fillna(pd.Timestamp.today(), inplace=True)

🚩 Step 3: Feature Engineering

3.1 Create tenure feature (days since signup)

python

CopyEdit

df["tenure_days"] = (pd.Timestamp.today() - df["signup_date"]).dt.days

3.2 Days since last login

python

CopyEdit

df["days_since_last_login"] = (pd.Timestamp.today() - df["last_login"]).dt.days

3.3 Average monthly charge feature

python

CopyEdit

df["avg_monthly"] = df["total_charges"] / (df["tenure_days"] / 30.0)

3.4 High-value customer flag (lambda example)

python

CopyEdit

df["high_value"] = df["monthly_charges"].apply(lambda x: 1 if x > 50 else 0)

🚩 Step 4: Categorical Encoding

python

CopyEdit

# One-hot encode 'plan'

df = pd.get_dummies(df, columns=["plan"], drop_first=True)

🚩 Step 5: Feature Selection

python

CopyEdit

features = ["monthly_charges", "total_charges", "tenure_days", "days_since_last_login", "avg_monthly", "high_value", "plan_premium"]

X = df[features]

y = df["churn"]

🚩 Step 6: Train-Test Split

python

CopyEdit

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

🚩 Step 7: Model Training

python

CopyEdit

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train)

🚩 Step 8: Evaluation

python

CopyEdit

from sklearn.metrics import classification_report

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

🚩 Step 9 (Optional): Feature Importance Check

python

CopyEdit

import matplotlib.pyplot as plt

feat_importances = pd.Series(model.feature_importances_, index=X.columns)

feat_importances.sort_values().plot(kind='barh') plt.show()