A Deep Learning Methodology for Predicting Cybersecurity Attacks on the Internet of Things

Introduction to Cybersecurity in IoT

The Internet of Things (IoT) connects devices with intelligent machines and sensors to the internet.
IoT systems integrate apps, data storage, and services, creating cyberattack entry points.
Continuous monitoring is crucial for IoT system security.
Predicting attack types is vital for defense analysis and IoT device tracking.
Adapting to unexpected events, ensuring data protection and stability, and minimizing risks are key benefits of attack prediction.
Traditional attack prediction methods struggle with the volume and variety of attacks.
Machine learning (ML) and deep learning (DL) are now popular for prediction-based tasks.
AI algorithms, like ML and DL, can efficiently use data to forecast and identify cybersecurity threats in the IoT.
Deep learning is increasingly used for cyberattack identification and efficient mitigation because it processes complex, nonlinear patterns for predictions.
Deep learning models are essential for defending against IoT attacks, helping to detect, respond to, and prevent threats.
As IoT devices become more interconnected, deep learning aids in detecting and mitigating harmful attacks and preventing future ones.

Main Contributions of the Work

An AI model using DL and various machine and ensemble learning classifiers is proposed to detect cyber-attacks on the IoT with SMOTE (Synthetic Minority Over-sampling Technique) implementation to yield significant results.
Improve the accuracy and confidence of cybersecurity attack detection in IoT environments compared to current works.
Produce more accurate and reliable predictions, leading to improved IoT security by preventing unauthorized access, data breaches, and service interruptions.
Enhance the generalization capabilities of the developed models by addressing the class imbalance issues commonly observed in IoT cybersecurity datasets through the application of SMOTE.
Bring an understanding of the optimal application of DL and ensemble learning models as cybersecurity attack prediction classifiers.

Literature Review Summary

ML Method for Malware Detection [19]: Proposes an ML method for malware detection in IoT networks without feature engineering, speeding up IoT edge with minimal power consumption.
DL Algorithms for DDoS Detection [20, 21]: Suggests merging RNN, LSTM-RNN, and CNN to create a bidirectional CNN-BiLSTM DDoS detection model. Achieved high accuracy rates, with CNN-BiLSTM reaching 99.76% accuracy and 98.90% precision using the CICIDS2017 dataset.
HIDE Scheme for Autonomous Vehicle Validation [22]: Introduces a heuristic distributed scheme (HIDE) to validate mobility patterns of autonomous vehicles in the Internet of Vehicles (IoV), improving traffic management systems.
DL Model for Cybersecurity Assaults [23]: Implemented a DL model to forecast prevalent cybersecurity assaults, achieving an efficacy of 0.99% with a test duration of 2.29 ms.
Federated DL for IoT Traffic Privacy [24]: Explored federated DL using several DL techniques, examining the efficacy of three IoT traffic databases in ensuring data privacy and enhancing attack detection accuracy.
FDL for Zero-Day Attack Detection [25]: Suggested FDL for detecting zero-day attacks, classifying network traffic using an ideal DNN architecture and the Federated Averaging (FedAvg) method.
LSTM Autoencoder for Feature Dimensionality Reduction [26]: Proposed using the encoding phase of the LSTM Autoencoder to reduce feature dimensionality of large-scale IoT network traffic data (LAE), requiring 91.89% less memory.
LGBA-NN for Botnet Attack Detection [27]: Presented a Local–Global best Bat Algorithm for Neural Networks (LGBA-NN) for effective detection of botnet assaults, achieving 90% accuracy.
HDRaNN for Cyberattack Detection in IIoT [28]: Introduces a unique hybrid deep random NN (HDRaNN) for cyberattack detection in the Industrial Internet of Things (IIoT), classifying sixteen distinct categories of cyberattacks with an accuracy of 0.98 to 0.99.
ML-Based Security Technique for RPL Loophole Attack [29]: A security technique based on ML was described for the RPL loophole attack. The evaluation of the gathered data revealed that the machine learning-based algorithms identified the loophole attack correctly.
Deep Learning for Cyber Assaults Identification [30]: Developed a technique using deep learning to identify cyber assaults directed against IoT equipment, achieving an accuracy rate of over 99% using a Modbus dataset.
ML Techniques for Cybersecurity Threats Prediction [32]: Proposed a model based on a variety of ML techniques for many cybersecurity threats that were anticipated. Using an initial number according to efficiency and the ROC AUC result, the optimal algo- rithm was determined.

Proposed Model

The work introduces an automated network detection model for the Internet of Things that gathers sensor-collected flow data transmitted to feature engineering algorithm techniques. It utilizes feature engineering techniques such as feature selection and feature imbalance:

Feature Selection: Techniques like Recursive Feature Elimination and Principal Component Analysis address data problems like overfitting and training time.
SMOTE Approach: Used for balancing data and addressing class imbalance.
Deep Learning Models: Executed to determine performance and time complexity.

Bot-IoT Dataset

A new dataset for simulated assault identification in the experiment using the IoT network.
Includes data from the Internet of Things collected from Cyber Range Lab of UNSW Canberra, as well as ordinary traffic flows and traffic flows caused by botnets because of various types of attacks
A realistic testbed was used to create a valuable dataset with comprehensive traffic information.
Additional features were added and labeled to improve the machine learning models’ performance.
Three subcomponents contributed to the extraction of characteristics: simulated IoT services, networking structure, and investigative analyses.
The IoT system can gather real-time meteorological data and utilize them to adjust settings. A smart cooling fridge communicates cooling and temperature details, while a smart device manages lighting.
These lights function as motion detectors and turn on automatically when motion is detected. The list also includes an IoT smart door with probabilistic input and an intelligent thermostat that can adjust the temperature autonomously.

Table 1. Bot-IoT dataset

Type	Target	Count
BENIGN	Benign	9543
DDoS TCP	Attack	19,547,603
DDoS UDP	Attack	18,965,106
DDoS HTTP	Attack	19,771
DoS TCP	Attack	12,315,997
DoS UDP	Attack	20,659,491
DoS HTTP	Attack	29,706
Keylogging	Keylogging	1469
Data theft	Data theft	118
Total	-	73,370,443

Target Categories:

Benign: Normal, legitimate IoT network activity without malicious intent.
DDoS TCP attacks: Flood a network with TCP requests.
UDP-focused DDoS attacks: Flood networks with packets.
DDoS HTTP attacks: Flood web servers with HTTP requests.
TCP DoS attacks: Exploit TCP stack vulnerabilities.
UDP DoS attacks: Flood the target with many packets.
HTTP-based DoS attacks: Overload web servers with excessive requests.
Keylogging: Covert monitoring and recording of keystrokes.
Capture of data: Unauthorized capture and exfiltration of information.

Data Pre-Processing

An essential component of model development.
Data cleansing comprises data filtration, the conversion of data, and checking for missing data.
In the data filtration phase, null and duplicate values are obtained and eliminated.
In the data transformation procedure, the data are converted into the appropriate format, such as from categorial to a numerical.

Feature Engineering Techniques

Correlation Coefficient

The correlation coefficient measures the relationship between two factors in a given dataset.
Analyzing the correlation coefficient can provide valuable insights into the interdependencies and associations between different variables.
Enhancing comprehension of the dataset and its potential patterns.
In specifying the BoT-IoT dataset variables for which the correlation coefficient is to be computed, these variables may include device type, communication protocols, network traffic patterns, and any other pertinent factors that may be present in the dataset.
A high level of correlation shows that as one factor rises, the other is usually increasing as well, while a single factor rising and the other factor tending to go down is indicative of a negative correlation.
A correlation coefficient near 0 indicates a non-existent relationship between the variables.

Feature Importance using Random Forest

Feature importance analysis utilizing Random Forest is an effective method for determining the significance of various features of the BoT-IoT dataset.
This analysis reveals which characteristics have the greatest impact on the dependent variable.
The BoT-IoT dataset is divided into subsets for training and testing.
The attributes ‘pkSeqID’, ‘proto’, ‘saddr’, ‘sport’, ‘daddr’, ‘dport’, and ‘category’, which have low significant features in the BoT-IoT dataset, were dropped.

SMOTE Approach

An enhanced approach for handling unbalanced data.
The SMOTE algorithm generates new samples by performing random linear interpolation between a select number of samples and the samples that are located nearby.
To enhance the classification impact of the unbalanced dataset and thus raise the data imbalance ratio, a given number of false minority samples are generated.

Ensemble Learning

Extra Trees Classifier: A variant of the Random Forest algorithm that includes extra randomness in the construction of decision trees. Mitigates overfitting and improves the accuracy of generalization by aggregating its results
Histogram-based Gradient Boosting Classifier: Employs histograms to enhance both computational efficiency and predictive accuracy.
Adaptive Boosting Classifier: Combines weak learners iteratively to produce a robust classifier, improving classification accuracy compared to a single weak learner.
LGBM classifiers: It utilizes a histogram-based approach for binning the continuous features, which significantly reduces the memory footprint and speeds up the training process.
CatBoosting Classifier: Is a robust machine learning technique specifically intended for classification tasks. Furthermore, CatBoost includes a symmetric building of trees technique that takes the statistical characteristics of the dataset into consideration.

Evaluation Metrics

Metrics: precision, recall, computation time, accuracy, and F1-score
True positive rate (TPR): ratio of observed positives to expected positives
False positive rate (FPR): ratio of values that are truly negative but are expected to be positive
False negative rate (FNR): ratio values that are in fact positive but are projected to be negative
True negative rate (TNR): ratio values that are negative and anticipated to become negative
Precision: The system’s ability to accurately detect the existence of an attack or security breach; it illustrates the relationship between precisely predicted attacks and actual consequences
Precision = \frac{TPR}{TPR + FPR}
Recall: The system’s ability to correctly recognize a botnet attack when it occurs on a network
Recall = \frac{TPR}{TPR + FPR}
Accuracy: The system’s ability to effectively classify attack and non-attack packets; it represents the percentage of accurate predictions relative to the total number of samples
Accuracy = \frac{TNR + TPR}{TPR + FNR + FPR + TNR}
F1-score: Average of recall and precision; it provides the percentage of normal and attacking flow samples accurately anticipated in the testing sample
F1\text{-Score} = 2 \times \frac{Recall \times Precision}{Recall + Precision}
Time complexity: How quickly or slowly an algorithm performs in the same relation to the amount of data.

Experimental Settings

The Python programming language, as well as several AI and deep learning frameworks and packages that serve as benchmarks, were used. These included the TensorFlow and Keras libraries, which were run on the Google CoLab GPU environment. The database was initially partitioned into three parts: 70% for the training, 20% for the validation development, and 10% for the testing.

Experimental Results

Overview of the experimental outcomes of the study, which evaluated the performance of ten separate ML models for detecting malware. These models consist of two single classifiers, ensemble classifiers, and four architectures for deep learning comparing the efficacy of these models with and without the SMOTE algorithm for managing imbalanced data.

Experiments without Using the SMOTE Algorithm

The performance results for the deep learning models on the BoT-IoT dataset reveal varying levels of performance in terms of accuracy, precision, recall, and F1-score. It is important to note that these results were obtained without utilizing the SMOTE algorithm. From the results, it is observed that Random Forest, Extra Trees, and KNN achieved competitive performance in terms of accuracy, precision, recall, and F1-score. These models were able to effectively classify instances in the dataset without the need for oversampling techniques.

Experiments Using the SMOTE Algorithm

The results provides a thorough review of several machine learning models based on their precision, recall, F1-score, CPU time, and model size. CatBoost and XGBoost models demonstrated superior performance in detecting IoT network attacks based on the performance metrics

Discussion

The performance results provide information on the efficacy of various classifiers in detecting IoT network intrusions on the BoT-IoT dataset. A comparison reveals the effect that the SMOTE algorithm has on the performance metrics.

Conclusions

The objective is to implement an intelligent system for IoT pro- tection devices using a novel deep learning-based model to manage extremely complex datasets. The proposed models will combine deep learning approaches with feature engineering to overcome obstacles such as overfitting, extended training times, and low model accuracy. CatBoost and XGBoost outperform deep learning models that learn from experience, especially when identifying future cyberattacks against IoT networks. A real-time dataset BoT-IoT represents enormous volumes of traffic that are affected by multiple types of attacks. CatBoost and XGBoost classifiers attained respective accuracy rates of 98.19% and 98.50%. The best classifiers are consistent and dependable across the BoT-IoT dataset, making them viable options for detecting IoT network attacks regardless of the implementation of the SMOTE algorithm.