π₯ Real ML Case Study Coding Drill β End-to-End
Problem Scenario:
You are given a customer dataset to predict whether a customer will churn.
The data is raw, messy, and contains:
Numerical, categorical, date columns
Missing values
Outliers
π Sample Dataset
pythonCopyEdit
import pandas as pd
import numpy as np
# Simulated dataset
data = { 'customer_id': [1, 2, 3, 4, 5], 'signup_date': ['2021-01-01', '2020-05-10', '2019-08-15', '2021-02-20', '2020-12-01'], 'last_login': ['2022-12-01', '2021-06-01', '2022-01-01', None, '2022-11-20'], 'plan': ['basic', 'premium', 'basic', 'basic', 'premium'], 'monthly_charges': [29, 79, 35, None, 79], 'total_charges': [400, 1200, 700, 800, None], 'churn': [0, 1, 0, 1, 0] }
df = pd.DataFrame(data)
π© Step 1: Data Loading & Initial Inspection
pythonCopyEdit
# Convert dates
df["signup_date"] = pd.to_datetime(df["signup_date"])
df["last_login"] = pd.to_datetime(df["last_login"])
# Quick look at data
print(df.info())
print(df.describe(include="all"))
π© Step 2: Handling Missing Values
pythonCopyEdit
# Fill missing numeric columns with median
df["monthly_charges"].fillna(df["monthly_charges"].median(), inplace=True)
df["total_charges"].fillna(df["total_charges"].median(), inplace=True)
# Fill missing last login with today's date
df["last_login"].fillna(pd.Timestamp.today(), inplace=True)
π© Step 3: Feature Engineering
3.1 Create tenure feature (days since signup)
pythonCopyEdit
df["tenure_days"] = (pd.Timestamp.today() - df["signup_date"]).dt.days
3.2 Days since last login
pythonCopyEdit
df["days_since_last_login"] = (pd.Timestamp.today() - df["last_login"]).dt.days
3.3 Average monthly charge feature
pythonCopyEdit
df["avg_monthly"] = df["total_charges"] / (df["tenure_days"] / 30.0)
3.4 High-value customer flag (lambda example)
pythonCopyEdit
df["high_value"] = df["monthly_charges"].apply(lambda x: 1 if x > 50 else 0)
π© Step 4: Categorical Encoding
pythonCopyEdit
# One-hot encode 'plan'
df = pd.get_dummies(df, columns=["plan"], drop_first=True)
π© Step 5: Feature Selection
pythonCopyEdit
features = ["monthly_charges", "total_charges", "tenure_days", "days_since_last_login", "avg_monthly", "high_value", "plan_premium"]
X = df[features]
y = df["churn"]
π© Step 6: Train-Test Split
pythonCopyEdit
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
π© Step 7: Model Training
pythonCopyEdit
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train)
π© Step 8: Evaluation
pythonCopyEdit
from sklearn.metrics import classification_report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
π© Step 9 (Optional): Feature Importance Check
pythonCopyEdit
import matplotlib.pyplot as plt
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.sort_values().plot(kind='barh') plt.show()