Paper 2

Detection of Phishing Websites by Using Machine Learning-Based URL Analysis

Authors and Affiliations

  • Mehmet KorkmazYildiz Technical UniversityComputer Engineering Department, Istanbul/Turkeymkorkmazzz@gmail.com

  • Ozgur Koray SahingozIstanbul Kultur UniversityComputer Engineering Department, Istanbul/Turkeysahingoz@gmail.com

  • Banu DiriYildiz Technical UniversityComputer Engineering Department, Istanbul/Turkeydiri@yildiz.edu.tr


Abstract

  • Increasing trend to shift real-world operations to cyberspace due to mobile device usage.

  • Rise in security breaches attributed to anonymous nature of Internet.

  • Existing security measures (antivirus, firewalls) are often ineffective against sophisticated phishing attacks.

  • Phishing attacks target user weaknesses by imitating legitimate sites to steal sensitive information (user IDs, passwords, bank details).

  • Current solutions include blacklists, rule-based detection, and increasingly, machine learning-based anomaly detection.

  • This study proposes a machine learning-based phishing detection system analyzing URLs using eight different algorithms across three distinct datasets, demonstrating high accuracy.


Introduction

  • Reliance on digital platforms has increased in various sectors such as trade, education, and banking.

  • However, this reliance creates significant vulnerabilities regarding information security.

  • Cyber-attacks, including phishing, have evolved, exploiting users' vulnerabilities instead of system weaknesses.

Characteristics of Phishing Attacks
  • Perpetrators impersonate trustworthy entities through emails and posts to deceive users into providing personal information.

  • The financial cost of phishing breaches is substantial; in 2019, average cost per attack ranged from $108K to $1.4 billion.

  • Reports show a dramatic increase in phishing attack methods and sophistication, including the use of HTTPS.


Phishing Attack Trends

  • Phishing attacks have evolved from simple email schemes in the 1990s to complex processes including fake websites.

  • Fake websites often resemble real ones closely to deceive users effectively.

  • Regular reports (e.g. from APWG) indicate trends and increases in phishing sites and attacks.


Literature Survey

Detection Systems Overview
  • List-Based Detection: Utilizes whitelists (trusted sites) and blacklists (phishing sites) for classification.

    • Issue: Vulnerability to minor URL changes and ineffective against zero-day attacks.

  • Rule-Based Systems: Employ feature mining to identify phishing URLs using established rules.

  • Visual Similarity Systems: Focus on visual differences between websites using image processing to identify potential phishing sites.

  • Machine Learning-Based Systems: Classify features from URLs, leveraging various AI techniques for high accuracy and real-time detection against zero-day attacks.

    • Notable systems include CANTINA and PhishWHO.


Proposed System

Objectives
  • Develop a phishing detection system analyzing URLs directly without third-party services or external content, focusing on efficiency and speed of classification.

Datasets Used
  • Utilized data from three datasets combining legitimate sites and phishing sites from reputable sources like Alexa and PhishTank.

    • Dataset Details:

      • Dataset-1: 40,668 phishing and 43,189 legitimate URLs.

      • Dataset-2: 40,668 phishing and 42,220 legitimate URLs.

      • Dataset-3: 40,668 phishing and 85,409 legitimate URLs.

Feature Extraction

  • Identified 58 features relevant to URL analysis derived from extensive literature reviews, employing Python scripts for extraction and sorting via Random Forest Classifier.

  • Limited to 48 critical features to optimize performance without sacrificing accuracy.


System Implementation

Machine Learning Algorithms Used
  1. Logistic Regression (LR): Effective with binary outcomes but sensitive to feature repetition and noise.

  2. K-Nearest Neighborhood (KNN): Fast but memory-intensive, sensitive to feature selection.

  3. Support Vector Machine (SVM): Good for larger datasets but poorly handles noisy data.

  4. Decision Tree (DT): Easy to interpret but prone to overfitting.

  5. Naive Bayes (NB): Quick and straightforward; struggles with correlated features.

  6. XGBoost: Focuses on performance but can be overfitted.

  7. Random Forest (RF): Resilient to noise; requires significant computational resources.

  8. Artificial Neural Network (ANN): Can generalize well; high power and memory consumption needed based on architecture.


Experimental Results

Accuracy and Training Time Analysis
  • Dataset-1: RF achieved highest accuracy at 94.59%.

  • Dataset-2: RF reported at 90.50% accuracy.

  • Dataset-3: RF reported at 91.26% accuracy.

Comparative Analysis of Results
  • The study found higher accuracy rates using the proposed method against previous implementations, especially when utilizing RF classifier, corroborating the effectiveness of the designed system.


Conclusion and Future Works

  • Addressing phishing threats through advanced machine learning techniques is crucial in an increasingly digital landscape.

  • A focus on developing a larger dataset for improved classification is vital.

  • Future work will explore hybrid algorithms and integrate deep learning models for enhanced phishing detection capabilities.