Paper 2
Detection of Phishing Websites by Using Machine Learning-Based URL Analysis
Authors and Affiliations
Mehmet KorkmazYildiz Technical UniversityComputer Engineering Department, Istanbul/Turkeymkorkmazzz@gmail.com
Ozgur Koray SahingozIstanbul Kultur UniversityComputer Engineering Department, Istanbul/Turkeysahingoz@gmail.com
Banu DiriYildiz Technical UniversityComputer Engineering Department, Istanbul/Turkeydiri@yildiz.edu.tr
Abstract
Increasing trend to shift real-world operations to cyberspace due to mobile device usage.
Rise in security breaches attributed to anonymous nature of Internet.
Existing security measures (antivirus, firewalls) are often ineffective against sophisticated phishing attacks.
Phishing attacks target user weaknesses by imitating legitimate sites to steal sensitive information (user IDs, passwords, bank details).
Current solutions include blacklists, rule-based detection, and increasingly, machine learning-based anomaly detection.
This study proposes a machine learning-based phishing detection system analyzing URLs using eight different algorithms across three distinct datasets, demonstrating high accuracy.
Introduction
Reliance on digital platforms has increased in various sectors such as trade, education, and banking.
However, this reliance creates significant vulnerabilities regarding information security.
Cyber-attacks, including phishing, have evolved, exploiting users' vulnerabilities instead of system weaknesses.
Characteristics of Phishing Attacks
Perpetrators impersonate trustworthy entities through emails and posts to deceive users into providing personal information.
The financial cost of phishing breaches is substantial; in 2019, average cost per attack ranged from $108K to $1.4 billion.
Reports show a dramatic increase in phishing attack methods and sophistication, including the use of HTTPS.
Phishing Attack Trends
Phishing attacks have evolved from simple email schemes in the 1990s to complex processes including fake websites.
Fake websites often resemble real ones closely to deceive users effectively.
Regular reports (e.g. from APWG) indicate trends and increases in phishing sites and attacks.
Literature Survey
Detection Systems Overview
List-Based Detection: Utilizes whitelists (trusted sites) and blacklists (phishing sites) for classification.
Issue: Vulnerability to minor URL changes and ineffective against zero-day attacks.
Rule-Based Systems: Employ feature mining to identify phishing URLs using established rules.
Visual Similarity Systems: Focus on visual differences between websites using image processing to identify potential phishing sites.
Machine Learning-Based Systems: Classify features from URLs, leveraging various AI techniques for high accuracy and real-time detection against zero-day attacks.
Notable systems include CANTINA and PhishWHO.
Proposed System
Objectives
Develop a phishing detection system analyzing URLs directly without third-party services or external content, focusing on efficiency and speed of classification.
Datasets Used
Utilized data from three datasets combining legitimate sites and phishing sites from reputable sources like Alexa and PhishTank.
Dataset Details:
Dataset-1: 40,668 phishing and 43,189 legitimate URLs.
Dataset-2: 40,668 phishing and 42,220 legitimate URLs.
Dataset-3: 40,668 phishing and 85,409 legitimate URLs.
Feature Extraction
Identified 58 features relevant to URL analysis derived from extensive literature reviews, employing Python scripts for extraction and sorting via Random Forest Classifier.
Limited to 48 critical features to optimize performance without sacrificing accuracy.
System Implementation
Machine Learning Algorithms Used
Logistic Regression (LR): Effective with binary outcomes but sensitive to feature repetition and noise.
K-Nearest Neighborhood (KNN): Fast but memory-intensive, sensitive to feature selection.
Support Vector Machine (SVM): Good for larger datasets but poorly handles noisy data.
Decision Tree (DT): Easy to interpret but prone to overfitting.
Naive Bayes (NB): Quick and straightforward; struggles with correlated features.
XGBoost: Focuses on performance but can be overfitted.
Random Forest (RF): Resilient to noise; requires significant computational resources.
Artificial Neural Network (ANN): Can generalize well; high power and memory consumption needed based on architecture.
Experimental Results
Accuracy and Training Time Analysis
Dataset-1: RF achieved highest accuracy at 94.59%.
Dataset-2: RF reported at 90.50% accuracy.
Dataset-3: RF reported at 91.26% accuracy.
Comparative Analysis of Results
The study found higher accuracy rates using the proposed method against previous implementations, especially when utilizing RF classifier, corroborating the effectiveness of the designed system.
Conclusion and Future Works
Addressing phishing threats through advanced machine learning techniques is crucial in an increasingly digital landscape.
A focus on developing a larger dataset for improved classification is vital.
Future work will explore hybrid algorithms and integrate deep learning models for enhanced phishing detection capabilities.