Project Proposal

Introduction

  • Project on credit card fraud detection using machine learning.

  • Credit card fraud is a growing issue with significant consequences for consumers and businesses.

  • Team members: Sammriti, Krishi Doshi, Gus Wang, Jake Moskowitz.

Problem Statement

  • Financial Impact: Billions lost yearly due to credit card fraud, affecting consumers and financial institutions.

  • Rarity of Fraud: Less than 1% of transactions are fraudulent, making detection challenging.

  • Inadequate Traditional Systems: Existing rule-based systems can't keep up with evolving fraud tactics.

  • Operational Challenges:

    • High false positives frustrate customers and lead to legitimate transaction declines.

    • False negatives result in significant financial losses.

    • Real-time detection is difficult due to high transaction volumes needing quick processing.

Objectives

  • Accurate Detection Models: Develop machine learning models to minimize false positives and negatives.

  • Address Class Imbalance: Tackle the dataset's class imbalance to enhance model efficiency.

  • Enhance Real-Time Detection: Ensure models handle high transaction volumes instantly while minimizing customer disruption.

Dataset Description

  • Dataset comprises transactions from European cardholders (02/2013).

  • Transaction features anonymized, represented as principal components (columns v1 to v28).

  • Key Columns:

    • Time: Time elapsed since first transaction (starts at zero).

    • Amount: Value of the transaction.

    • Class: Indicates whether the transaction was fraudulent.

  • Total of 284,807 transactions; only 492 (0.172%) were fraudulent.

Prior Work

  • Integration of genetic algorithms for feature selection improves machine learning performance in fraud detection.

  • Synthetic Minority Oversampling Technique (SMOTE): Used to generate synthetic samples for the minority class to balance datasets.

  • Ensemble Learning Techniques: Combining multiple classifiers (like AdaBoost, Gradient Boosting, XGBoost) significantly enhances fraud detection performance.

Methodology

  • Supervised and Unsupervised Learning: Multiple methods will be applied to train models and compare datasets.

  • Data Processing: Load dataset, handle missing values, extract useful time features.

  • Model Selection: Primary models include:

    • Logistic Regression: Baseline model, interpretable, assumes linear relationship between features and fraud probability.

    • Decision Tree: Classifies and identifies important features by splitting data recursively.

    • K Nearest Neighbors: Detects fraud by comparing new transactions with closest database neighbors.

    • Isolation Forest: Unsupervised technique that identifies outliers, useful for detecting anomalies in transactions.

Model Evaluation

  • Data Split: 80% training set, 20% testing set.

  • Ensemble Model: Combines outputs of individual classifiers to improve overall detection accuracy.

  • Metrics Used: Precision and recall are emphasized, as accuracy can be misleading with imbalanced datasets.

Challenges

  • Highly Imbalanced Dataset: Fraud cases are rare, potentially biasing models.

  • Feature Engineering: Raw data is anonymized; extracting useful features is challenging.

  • Balancing False Positives and Negatives: Important to maximize fraud detection without impacting legitimate transactions.

Conclusion

  • Credit card fraud detection poses significant challenges in the financial sector.

  • This project explores multiple machine learning models to effectively identify fraudulent transactions.