Project Proposal

Introduction

Project on credit card fraud detection using machine learning.
Credit card fraud is a growing issue with significant consequences for consumers and businesses.
Team members: Sammriti, Krishi Doshi, Gus Wang, Jake Moskowitz.

Financial Impact: Billions lost yearly due to credit card fraud, affecting consumers and financial institutions.
Rarity of Fraud: Less than 1% of transactions are fraudulent, making detection challenging.
Inadequate Traditional Systems: Existing rule-based systems can't keep up with evolving fraud tactics.
Operational Challenges:
- High false positives frustrate customers and lead to legitimate transaction declines.
- False negatives result in significant financial losses.
- Real-time detection is difficult due to high transaction volumes needing quick processing.

Accurate Detection Models: Develop machine learning models to minimize false positives and negatives.
Address Class Imbalance: Tackle the dataset's class imbalance to enhance model efficiency.
Enhance Real-Time Detection: Ensure models handle high transaction volumes instantly while minimizing customer disruption.

Dataset comprises transactions from European cardholders (02/2013).
Transaction features anonymized, represented as principal components (columns v1 to v28).
Key Columns:
- Time: Time elapsed since first transaction (starts at zero).
- Amount: Value of the transaction.
- Class: Indicates whether the transaction was fraudulent.
Total of 284,807 transactions; only 492 (0.172%) were fraudulent.

Integration of genetic algorithms for feature selection improves machine learning performance in fraud detection.
Synthetic Minority Oversampling Technique (SMOTE): Used to generate synthetic samples for the minority class to balance datasets.
Ensemble Learning Techniques: Combining multiple classifiers (like AdaBoost, Gradient Boosting, XGBoost) significantly enhances fraud detection performance.

Supervised and Unsupervised Learning: Multiple methods will be applied to train models and compare datasets.
Data Processing: Load dataset, handle missing values, extract useful time features.
Model Selection: Primary models include:
- Logistic Regression: Baseline model, interpretable, assumes linear relationship between features and fraud probability.
- Decision Tree: Classifies and identifies important features by splitting data recursively.
- K Nearest Neighbors: Detects fraud by comparing new transactions with closest database neighbors.
- Isolation Forest: Unsupervised technique that identifies outliers, useful for detecting anomalies in transactions.

Data Split: 80% training set, 20% testing set.
Ensemble Model: Combines outputs of individual classifiers to improve overall detection accuracy.
Metrics Used: Precision and recall are emphasized, as accuracy can be misleading with imbalanced datasets.

Highly Imbalanced Dataset: Fraud cases are rare, potentially biasing models.
Feature Engineering: Raw data is anonymized; extracting useful features is challenging.
Balancing False Positives and Negatives: Important to maximize fraud detection without impacting legitimate transactions.

Credit card fraud detection poses significant challenges in the financial sector.
This project explores multiple machine learning models to effectively identify fraudulent transactions.