Web services are widely used, bringing convenience but also security threats.
Side-channel attacks (SCAs) are covert and dangerous, exploiting non-explicit information like network traffic patterns and response times to steal sensitive user data.
Traditional detection methods using rule-based feature engineering and statistical analysis have limitations with complex attack patterns and large-scale network traffic data.
A side-channel leakage detection method based on SSA-ResNet-SAN is proposed.
SSA (sparrow search algorithm) optimizes feature subset selection.
Deep residual networks (ResNet) and signature aggregation network (SAN) analyze single-attribute and aggregated-attribute features.
This improves model accuracy and robustness.
Experiments show SSA-ResNet-SAN significantly outperforms existing methods.
On the Google dataset, it achieves 93% accuracy using aggregated attribute features, substantially higher than other models.
In multi-class tasks on Baidu and Bing datasets, it shows strong robustness and applicability.
It provides an efficient and reliable solution for Web security.
Web services have penetrated various aspects of social life, bringing convenience.
Cyberattacks, especially data theft and privacy leakage through Web service vulnerabilities, have increased.
Side-channel attacks (SCAs) exploit non-explicit information from system operations (network traffic patterns, response times, packet sizes, etc.) to extract sensitive user data.
SCAs have low technical barriers and significant potential for harm, making them an important hidden risk.
Modern Web systems, such as search engines and recommendation systems, requiring high-frequency user requests are particularly vulnerable.
SCAs can steal private data or infer sensitive system information without disrupting system operation, causing immeasurable damage.
Efficient detection and defense against Web side-channel attacks is a research focus in cybersecurity.
Traditional detection methods primarily rely on rule-based feature engineering and statistical analysis techniques.
These methods show limitations when facing large-scale, high-dimensional network traffic due to the increasing complexity of attack technologies and data scales.
Deep learning can automatically extract complex feature representations from massive data and capture hidden relationships through nonlinear modeling.
Deep learning has shown significant advantages in network traffic analysis, pattern recognition, and covert attack detection.
Convolutional neural networks (CNNs) extract spatial features from traffic data.
Recurrent neural networks (RNNs) capture temporal dependencies in traffic sequences.
Deep learning-based quantification techniques can assess the severity of side-channel leaks.
Existing methods often focus on analyzing single-dimensional or single-attribute features, neglecting the interaction between different attributes, which limits detection effectiveness.
There might be a high correlation between packet size, response time, and traffic sequence features, but traditional methods often model these features independently, ignoring their interrelationships.
Most research designs detection algorithms for general scenarios, but model performance often fails to meet expectations in specific application scenarios (such as search engine autocomplete functions).
Side-channel attacks in search engine autocomplete scenarios are characterized by frequent data interactions and complex traffic patterns, which place higher demands on the real-time and accuracy of detection algorithms.
Due to the high dimensionality, strong nonlinearity, and noise interference of network traffic data, existing deep learning methods often face issues of low computational efficiency and insufficient model robustness when dealing with large-scale network traffic data.
The SSA-ResNet-SAN method analyzes the attribute features of network traffic packets.
It constructs both single-attribute and aggregated-attribute feature vectors from filtered traffic files to comprehensively capture the interaction among multidimensional features.
It integrates ResNet with SSA to facilitate accurate modeling and feature extraction of network traffic, thereby significantly enhancing detection accuracy and robustness.
SSA-ResNet-SAN incorporates targeted optimization strategies tailored to the specific requirements of the search engine autocomplete scenario.
By effectively capturing pattern features and temporal dynamics within data traffic, it substantially improves the detection of side-channel vulnerabilities.
Methods range from early traditional techniques based on rules and statistical analysis to modern machine learning and deep learning algorithms.
Early approaches utilized round trip time (RTT) between networks as an attack feature, inferring users’ browsing behavior by monitoring delay patterns when accessing target websites.
Attackers identified specific websites or Web applications a user was visiting by comparing differences in RTT.
Machine learning techniques were incorporated to improve detection performance by modeling and classifying coarse-grained features of network traffic.
Classical classifiers such as support vector machines (SVM), random forests (RF), and K-nearest neighbors (KNN) have been widely applied to side-channel attack detection tasks.
These methods are capable of uncovering pattern features from large amounts of network traffic, thus effectively distinguishing normal traffic from abnormal traffic.
However, traditional machine learning methods heavily rely on feature engineering, often requiring manual design of data features, and they struggle to capture potential attack patterns in high-dimensional and complex scenarios.
With the rise of deep learning, researchers have begun exploring more intelligent detection methods, utilizing deep neural networks to automatically extract complex pattern features.
Stacked denoising autoencoders (SDAE) extract key features from data with significant noise interference through unsupervised learning of network traffic.
Convolutional neural networks (CNNs) excel at capturing spatial features of traffic data and can identify local patterns within side-channel data.
Long short-term memory (LSTM) networks are adept at handling time-series features and enhance detection accuracy by capturing temporal correlations in the traffic.
Sidebuster is an early black-box analysis tool capable of detecting side-channel vulnerabilities in Web applications.
It primarily relies on automated fuzz testing combined with behavior analysis of Web traffic to identify potentially vulnerable modules.
However, Sidebuster has drawbacks, such as poor adaptability to complex application scenarios, especially in real-time environments (e.g., search engine auto-suggestions), where its performance is limited.
Methods based on Google Web Toolkit (GWT) have been proposed, which perform preliminary exploration of side-channel attacks through fast frontend feature extraction.
These methods generally suffer from insufficient support for specific scenarios, poor real-time performance, or weak model generalization.
Most methods struggle to comprehensively analyze the interactions between multi-dimensional features, especially since the high-dimensional nature of network traffic data is often overlooked.
Existing methods are typically designed for general detection scenarios and lack optimization for specific contexts (e.g., search engine auto- suggestions), resulting in poor performance in complex scenarios.
Due to the dynamic changes in traffic data and noise interference, traditional methods still face challenges in terms of robustness and real-time processing capabilities in large-scale data environments.
A novel detection framework based on SSA-ResNet-SAN is introduced.
It integrates ResNet, SAN, and the SSA to analyze network traffic feature information from multiple perspectives.
ResNet extracts the core features of individual attributes within the network traffic, capturing intricate spatial patterns and filtering out key influencing factors through successive residual layers.
The SAN module then aggregates attribute feature vectors, fusing multi-dimensional traffic features to establish global dependencies, thereby enhancing the model’s capability to model complex interactions between these features.
The SSA module leverages the sparrow search algorithm (SSA) to dynamically optimize feature selection and parameter configuration, ensuring globally optimal feature selection and efficient model training.
The “Discoverer” and “Follower” mechanisms of SSA, in conjunction with the introduced “Sentinel” dynamic adjustment function, significantly bolster the model’s robustness and efficiency.
Through this architecture, the proposed method offers an accurate means of identifying side-channel vulnerabilities in search engine auto-suggestion scenarios, thus providing a robust technical foundation for side-channel leakage detection in Web applications.
SSA is a global optimization algorithm inspired by the cooperative behavior of sparrows in foraging, particularly the “discoverer–follower” collaboration strategy and the “sentinel” mechanism.
SSA achieves a balance between global exploration and local exploitation for solving complex optimization problems.
Each sparrow individual (Xi = [xi1, xi2, . . . , xid]) represents a d-dimensional solution vector.
The goal is to optimize the fitness function f(X) through iterative updates.
The population is initialized by randomly generating initial solutions, described as:
Xi(0) = Xmin + rand
cdot (Xmax − Xmin)
where Xmin and Xmax represent the lower and upper bounds of the search space, and rand is a matrix of random numbers uniformly distributed in [0, 1].
Discoverers are responsible for global exploration. Their position update strategy uses exponential decay and directional adjustments to balance exploration and exploitation:
Xt+1 i = Xt i \cdot exp − \frac{i}{\alpha \cdot T} + S \cdot (Xt i − Xt j ) \cdot I[R1 ≥ Pd]
where \alpha is the scaling factor, T is the maximum number of iterations, R1 ∼ U(0, 1) is a uniformly distributed random number, pd is the alert threshold for discoverers, and I is an indicator function that determines whether directional adjustment is triggered.
Followers update their positions based on the behavior of the discoverers for local exploitation, described by:
Xt+1 i = Xt i + F \cdot (Xt best − Xt i) + \sigma \cdot rand
where F is the learning rate, Xt best represents the best solution in the current population, \sigma ∼ N (0, \sigma2) is Gaussian noise, and rand introduces random perturbations to enhance diversity.
The sentinel mechanism dynamically adjusts the positions of some indi-viduals when the population is trapped in a local optimum. The sentinel updates are described as:
Xt+1 i = Xt i + \beta \cdot sign(f(Xt worst) − f(Xt i)) \cdot rand
where \beta is the adjustment coefficient, and f(Xt worst) is the fitness value of the worst individual in the population.
The fitness optimization goal minimizes the distance between individuals and the global optimal solution, expressed in integral form as:
f(Xt+1 i ) = \int_{\Omega} (Xt+1 i − Xopt)2d\Omega
where Xopt represents the global optimal solution at the current iteration, and \Omega denotes the search space.
To determine convergence, the mean change in the fitness of the population is used as the stopping criterion. The algorithm terminates when:
\frac{1}{N} \sum_{i=1}^{N} |f(Xt i) − f(Xt−1 i )| < \epsilon
where N is the population size, and \epsilon is the predefined threshold.
The signature aggregation network (SAN) is a key module used to model and aggregate multidimensional features of network traffic.
Its primary function is to allocate weights to features and produce global feature representations through the attention mechanism.
Given the input data X \in R^{n \times d}, where n represents the number of samples and d is the feature dimension.
The SAN module first performs a linear transformation on the input data to generate queries Q, keys K, and values V. Specifically:
Q = XWQ, \quad K = XWK, \quad V = XWV where WQ, WK, WV \in R^{d \times d'} are learnable weight matrices and d' is the transformed feature dimension.
The queries Q and keys K are then used to compute the attention weight matrix A through dot product, followed by normalization:
A{ij} = \frac{exp(\frac{Qi \cdot Kj^\top}{\sqrt{d'}})}{\sum{k=1}^{n} exp(\frac{Qi \cdot Kk^\top}{\sqrt{d'}})}, \quad \forall i, j \in {1, …, n}.
where \sqrt{d'} is a scaling factor used to stabilize gradients.
Using the weight matrix A, the value vectors V are aggregated through a weighted sum to generate the aggregated feature representation Z, expressed as:
Zi = \sum{j=1}^{n} A{ij}Vj, \quad Z \in R^{n \times d'}.
Finally, the aggregated features are mapped through a linear transformation and an activation function to produce the output (Y \in R^{n \times d}), represented as:
Y = \sigma(ZWO + bO), \quad WO \in R^{d' \times d}, bO \in R^d
where \sigma(\cdot) denotes the activation function, such as ReLU or Sigmoid.
Combining the above steps, the complete SAN process can be formulated as:
Y = \sigma \left( \text{Softmax} \left( \frac{XWQ (XWK)^\top}{\sqrt{d'}} \right) XWV WO + b_O \right).
SAN uses this mechanism to perform weighted aggregation and global modeling of features, effectively capturing complex interactions between features and providing richer feature representations for subsequent tasks.
In this paper, the SAN module is used to further process the single-attribute features extracted by ResNet, producing aggregated features that significantly enhance the detection capability for side-channel leakage.
A high-quality dataset for side-channel leakage detection is constructed using a Docker container-based experimental environment and tcpdump
tool for real-time network traffic capture.
This approach ensures precise recording of traffic characteristics during client–server interactions.
Realistic search engine usage scenarios generate highly precise and complete network traffic data.
Lightweight Docker containers offer excellent isolation, fast startup, and minimal resource consumption.
The Firefox browser running inside the container simulates user search behavior.
During the process of inputting a secret value, each character entered triggers an interaction between the client and the server, resulting in corresponding network traffic.
The tcpdump
tool is started before the input operation, monitoring network interfaces inside the container.
Each secret value W consists of multiple characters and can be represented as:
(Y \in R^{n \times d}).
l_j \in {A - Z, a - z, 0 - 9, …}, \quad j \in [1, r].
Every character entry triggers one or more client–server interactions, generating one or more data packets.
Once the secret value has been fully entered, tcpdump
generates a corresponding network traffic file f, named after the secret value, which serves as a label for its associated traffic.
Each generated traffic file f consists of multiple packets p, each recording attributes such as arrival time (time), destination address (Destination), source address (Source), protocol type, and packet size (length).
A complete network traffic file f can be represented as:
fi = {p1, p2, …, pm}, i \in [1, n],
The entire network traffic dataset can be represented as:
F = {f1, f2, …, fn}, where F is generated by entering multiple secret values wi during the experiment.
Multiple sets of traffic data are collected for each secret value to ensure proper functioning of the detection algorithm and enhance the reliability of the data.
The range and format of the secret values can be customized.
The input rhythm (e.g., character input intervals) and environmental variables (e.g., network delay and server response time) can be controlled to further enrich the dataset.
Three noise filtering strategies are proposed: protocol filtering, IP address filtering, and packet size filtering.
Protocol filtering discards all non-TCP traffic (such as UDP, ICMP packets, etc.), retaining only reliable TCP data transmission traffic and encrypted packets that use TLSv1.3 and TLSv1.2.
IP address filtering retains only traffic related to the target server by matching the source and destination IP addresses of the packets, discarding traffic from other unrelated IP addresses.
Packet size filtering cleans up anomalous or invalid packets by analyzing the normal packet size distribution of the target traffic and discarding packets with lengths smaller than L{min} or larger than L{max}.
Aggregated attribute feature vectors are constructed to accurately describe the interaction behavior between the client and server.
Raw traffic packets are processed and segmented into blocks, and statistical features of these blocks are extracted.
Each time a user enters a character l_j in the search box, a burst of network activity is triggered between the client and server.
These activities can be analyzed and segmented using time intervals \Delta = {\Delta1, \Delta2, …, \Delta{p-1}}, where the interval is defined as: \Deltai = P{i+1}.time - Pi.time
When consecutive time intervals (\Delta1, \Delta2, …, \Delta{i-1}) are smaller than a threshold t, but \Deltai > t, the sequence ([P1, P2, …, Pi]) is considered a complete data block B1, while packet P_{i+1} belongs to the next block.
Based on this rule, a network traffic file can be divided into multiple data blocks:
F = {B1, B2, B3, …, Bs}.
For each data block B_K, aggregated features can summarize the internal attributes of the block.
For instance, the size feature of a data block can be defined as the total size of all packets within the block:
BSk = \sum{j=1}^{ik} Pj.size
where ik is the number of packets in block BK.
The aggregated features of a network traffic file can be represented as the set of features from all blocks:
Feature(F) = {Feature(B1), Feature(B2), …, Feature(B_s)}.
Experimental datasets are sourced from multiple search engines, covering various categories and task complexities.
Initial experiments used data from Baidu and Bing search engines, with 10 designed categories (news, games, actors, athletes, concerts, movies, etc.).
Each category contained 5 training samples and 5 testing samples, totaling 50 training samples and 50 testing samples, which were used to construct the basic 10-class classification task.
The classification task was further expanded to include 20-class and 30-class datasets.
The Google dataset consisted of the top 10 global trending search terms of 2017 as secret values.
The software environment is based on the Ubuntu 20.04 LTS operating system, combined with Docker containers to provide an isolated experimental environment.
Firefox browser (version 89.0) was used to simulate user search behavior.
tcpdump 4.9.3 was employed to capture network traffic data.
Data preprocessing and feature extraction were performed using Python 3.8 and its main data processing libraries, such as Scapy, NumPy, and Pandas.
The deep learning framework PyTorch 1.9.0 was utilized to implement the training and testing of the side-channel leakage detection models.
The hardware environment used an Intel Xeon E5-2680 v4 (14 cores, 2.4 GHz) multi-core processor with an NVIDIA Tesla V100 GPU (32 GB HBM2 memory) and 128 GB DDR4 memory.
The side-channel leakage detection task is framed as a multi-class classification problem.
Evaluation metrics employed include accuracy, precision, recall, and F1-score, complemented by three aggregation strategies: macro average, weighted average, and micro average.
Macro average calculates the arithmetic mean of precision, recall, and F1-score across all classes. The formula for the macro-averaged precision is:
Precision{macro} = \frac{\sum{i=1}^{C} Precisioni}{C}, where C is the total number of classes, and Precisioni represents the precision for the i^{th} class. Similarly, the formulas for macro-averaged recall and F1-score are:
Recall{macro} = \frac{\sum{i=1}^{C} Recalli}{C}, F1{macro} = \frac{\sum{i=1}^{C} F1i}{C}.
Weighted average assigns weights to each class based on the proportion of samples in that class relative to the total number of samples. The formula for weighted precision is:
Precision{weighted} = \frac{\sum{i=1}^{C} mi \cdot Precisioni}{N},
where mi is the number of samples in the i^{th} class, and N = \sum{i=1}^{C} mi is the total number of samples. Similarly, the weighted recall and F1-score are calculated as follows: Recall{weighted} = \frac{\sum{i=1}^{C} mi \cdot Recalli}{N}, F1{weighted} = \frac{\sum{i=1}^{C} mi \cdot F1_i}{N}.
Micro average calculates the metrics by constructing a global confusion matrix across the entire dataset, ignoring the class labels. For micro-averaged precision, recall, and F1-score
The formulas for precision, recall, and F1-score are:
Precision{micro} = \frac{\sum{i=1}^{C} TPi}{\sum{i=1}^{C}(TPi + FPi)}, Recall{micro} = \frac{\sum{i=1}^{C} TPi}{\sum{i=1}^{C}(TPi + FNi)}, F1{micro} = 2 \cdot \frac{Precision{micro} \cdot Recall{micro}}{Precision{micro} + Recall{micro}}.
Where TPi, FPi, FN_i are the true positives, false positives, and false negatives for the i^{th} class, respectively.
Experiments were conducted on Baidu and Bing datasets to analyze the impact of different noise filtering strategies on the performance of SSA-ResNet-SAN.
Experiment 1 utilized an effective noise filtering method and achieved high accuracy across most categories.
Experiment 2 exhibited relatively lower accuracy, indicating that its filtering strategy reduced noise at the cost of losing some critical information.
Experiment 3 adopted a more lenient filtering strategy and included more irrelevant data, experiencing a significant drop in accuracy.
Experiments compared the impact of single-attribute features and aggregated-attribute features on the performance of SSA-ResNet-SAN using the Baidu and Bing datasets.
Detection accuracy of aggregated-attribute features was significantly higher than that of single-attribute features.
Aggregated-attribute features better capture the interactions between multi-dimensional attributes, thereby significantly improving detection performance.
Accuracy of the Google dataset on the LSTM and SSA-ResNet-SAN models exhibited significant differences.
With single-attribute features, SSA-ResNet-SAN achieved 75% accuracy, significantly higher than LSTM's 55%.
With aggregated attribute features, both models improved, but SSA-ResNet-SAN achieved 93% accuracy, 10 percentage points higher than LSTM's 83%.
The SSA-ResNet-SAN achieves perfect precision, recall, and F1 scores for several secret values, such as “Hurricane Irma,” “Las Vegas Shooting,” and “May- weather vs McGregor Fight,” indicating its ability to effectively handle these categories.
The overall performance highlights the effectiveness of SSA-ResNet-SAN in leveraging aggregated attribute features for side-channel leakage detection, with consistently high precision and strong performance across the majority of the secret values.
This study proposes an efficient detection method based on SSA-ResNet-SAN for side-channel leakage detection in Web security.
It constructs both single-attribute and aggregated-attribute feature vectors and integrates deep residual networks (ResNet) with a side-channel signature analysis mechanism (SSA).
Tailored optimization strategies are introduced to meet the specific demands of search engine autocomplete scenarios.
Experimental results demonstrate that SSA-ResNet-SAN outperforms existing methods.
Limitations include the computational complexity of the model when processing high-dimensional network traffic data and the limited dataset from search engine scenarios.
Future research will focus on improving the adaptability and efficiency of the model in Web security scenarios by optimizing feature extraction and model architecture, exploring techniques such as joint learning and multi-task learning, and expanding the scale of the datasets.