ZD

Software Security - Week 14_2

Web Attacks Detection Using Stacked Generalization Ensemble for LSTMs and Word Embedding

Abstract

  • Web applications are vulnerable due to open accessibility.
  • Traditional web attack prevention methods have limitations:
    • Inability to detect zero-day attacks.
    • Difficulty in analyzing complex attacks.
    • Require constant maintenance by security experts.
  • Recent research focuses on using deep learning for web intrusion detection.
  • High-risk web attacks are often injected into HTTP web requests.
  • Detecting these attacks involves classifying HTTP web requests as normal or anomalous.
  • The paper proposes an approach using Word2vec embedding and a stacked generalization ensemble model for LSTMs to detect malicious HTTP web requests.
  • Evaluated using the HTTP CSIC 2010 dataset.
  • Results indicate that combining different word-level embeddings in a stacked generalization ensemble model for LSTMs yields good performance in terms of classification metrics and training time.

Keywords

  • Web attacks detection
  • Web security
  • Deep learning
  • Word2vec
  • Stacked generalization ensemble
  • LSTM

1. Introduction

  • Web applications are widely used in various sectors like education, healthcare, and finance.
  • Drawbacks include security and protection of personal data.
  • A 2019 survey found:
    • 9 out of 10 web applications are vulnerable.
    • Sensitive data breaches are possible on 68% of web applications.
    • 8% of network intrusions are due to unauthorized access to web servers.
  • Traditional web security techniques (static and dynamic code analysis, WAF) have limitations:
    • Cannot detect zero-day attacks.
    • Cannot correlate and analyze events for attack chain detection.
    • WAFs require regular updates with new web vulnerability signatures.
  • Web intrusion detection systems based on machine learning and deep learning models offer a promising solution.
  • The paper proposes a deep learning-based approach for detecting malicious HTTP web requests.
  • The paper is organized as follows:
    • Section 2: Review of related research.
    • Section 3: Background information.
    • Section 4: Proposed approach explanation.
    • Section 5: Presents and discusses experimental results.
    • Conclusion and future work in the last section.

2. Literature Review

  • Review of existing literature on web vulnerabilities detection using deep learning.
  • [19]: Word-level embedding, manually extracted statistical features, CNN, and GRU networks.
    • Achieved 97.56% and 99.00% accuracy on the HTTP CSIC 2010 dataset (without and with features augmentation, respectively).
  • [14]: Combined CNN, LSTM networks, and one-hot encoding.
    • Detected 95% of web attacks in the HTTP CSIC 2010 dataset.
  • [21]: CNN networks and character-level embedding.
    • Achieved 92% accuracy at 0.1% false positives on detecting malicious URLs in a custom dataset.
  • [11]: ResNN and GRU networks.
    • Achieved 99% accuracy on the CW900 public dataset for detecting website fingerprinting attacks.
  • [16]: CNN based model.
    • Achieved 99.50% accuracy on a custom dataset for preventing SQL injection attacks.
  • [20]: Character-level embedding and CNN networks.
    • Achieved 0.02% false positive rate on a custom dataset for detecting SQLI, Directory Traversal, Remote File Inclusion, and XSS.
  • [2]: LSTM networks.
    • Achieved 99.97% accuracy on the HTTP CSIC 2010 dataset for intrusion detection systems.
  • [18]: 2-grams method, DBN, and stacked auto-encoder.
    • Showed that deep neural networks as feature extractors can improve traditional machine learning models' performance.
  • [13]: CNN, LSTM, and DNN with UTF-8 encoding.
    • Used to detect malicious URLs in online real-time data.
  • [15]: URLNet framework.
    • CNN applied to char-level word embedding for malicious URL detection.
    • Outperformed Word-level CNN and character-level CNN.
  • [17]: Textual and statistical features into an ensemble classification model.
    • Ensemble of LSTM, SVM, Random Forest, and Logistic Regression.
    • Ensemble model performed better than individual models in most cases.
  • [1]: Systematic literature review on Deep Learning-based web attacks detection.
    • Compiled research papers published between 2010 and 2021.

3. Preliminaries

3.1. Problem Definition

  • Web applications are susceptible to various web attacks.
  • Each web attack has unique characteristics, making feature extraction challenging.
  • HTTP web requests are the primary channel for critical web attacks.
  • Detecting web attacks requires classifying HTTP web requests as malicious or benign.
  • This work frames web attacks detection as a binary classification problem of HTTP web requests.
  • Malicious HTTP web request indicates a web attack; benign indicates its absence.
  • Word embedding transforms HTTP web requests into vectors.
  • A stacked generalization ensemble for deep learning models is used to distinguish between normal and anomalous HTTP web requests.

3.2. Deep Learning Techniques

  • Machine learning aims to create algorithms that learn from data and predict on new data.
  • Deep learning is a subfield of machine learning.
  • Key deep learning models used:
    • Convolutional Neural Networks (CNN):
      • Commonly used in image recognition and NLP.
      • Learn fewer parameters than fully connected networks, resulting in faster training.
      • Extract complex features through convolution and pooling operations.
    • Recurrent Neural Networks (RNN):
      • Network behavior varies over time.
      • Hidden neurons depend on previous layers and neurons at earlier times.
      • Suitable for sequential information processing.
      • LSTM and GRU are common RNN variants that address the vanishing gradient problem.
    • Stacking Ensemble for Deep Neural Networks:
      • A meta-model is trained to combine predictions from different sub-models.
      • Dataset is split into train, validation, and test sets.
      • Sub-models are fit on the train set, meta-model on the test set, and evaluation on the validation set.
      • Stacked generalization (stacking) improves predictive performance over single sub-models.

3.3. Word Embeddings

  • Various word representations exist (Bag Of Words, TF-IDF, GloVe).
  • This paper uses Word2vec embedding to generate vector representations of HTTP Web requests.

4. Proposed Approach

4.1. Overview

  • The proposed web attacks detection method includes:
    1. Relevant feature selection.
    2. Concatenation of selected features into a textual format.
    3. Input pre-processing (see section 4.2).
    4. Using different word2vec models to get different word embeddings of the pre-processed HTTP Web request.
    5. Passing word embeddings to a stacked generalization ensemble for LSTM models.
    6. Classification of HTTP web requests.

4.2. Preprocessing

4.2.1. Feature selection and concatenation
  • HTTP web requests consist of different elements.
  • Selected features:
    • HTTP method (operation to perform).
    • Request or URL (resource path).
    • Payload (data submitted by the client).
  • Concatenate the three elements to have a textual input.
4.2.2. Decoding
  • Web attackers bypass WAFs using encoding techniques.
  • URL decode the input data to obtain its original form.
4.2.3. Generalization
  • Replace "http://name-of-website" with "http://u" or "https://u".
  • Replace all numbers in the input data with "0".
  • Improves data quality by reducing meaningless information.
4.2.4. Tokenization
  • Use custom regular expressions to split the input data into a set of tokens.
  • These tokens serve as a text corpus to train word2vec models.
4.2.5. Vectorization: Word2vec
  • Word2vec: two-layer neural network that turns a text corpus into a set of vectors.
  • Includes two models: CBOW and Skip-Gram.
  • CBOW is faster and more accurate.
  • Modified training parameters (window size and embedding size) to obtain different vectors representations of the same word.

4.3. Classification: Stacked Generalization Ensemble of LSTM Models

  • Propose a stacked generalization ensemble for LSTMs to detect malicious HTTP web requests.
  • The classification process includes:
    1. Passing the HTTP web request through different word2vec models to get different word embeddings.
    2. Each sub-model (LSTM network) takes one of the resulting word embeddings and predicts whether the HTTP request is normal or malicious.
    3. The meta-model (shallow neural network) combines the predictions from each input sub-model and returns the final prediction.

5. Experiments and Results

  • Implemented and compared models based on LSTM, CNN, and stacking ensemble.
  • Each experiment is run 5 times, and the average value of each performance metric is reported.
  • Code available in the GitHub repository [10].

5.1. Dataset

  • HTTP CSIC 2010 dataset contains 223,585 samples.
  • Random split into three groups to avoid class imbalance:
    • Training set (67%).
    • Validation set (16%).
    • Test set (17%).

5.2. Data Pre-processing

5.2.1. Feature Selection and Concatenation
  • The HTTP CSIC 2010 dataset includes redundant and irrelevant features.
  • Kept three features:
    • HTTP method.
    • HTTP request.
    • Payload.
  • Concatenated the values of the three features (Table 1).
  • This allows exploiting vectorization techniques used in NLP problems.
5.2.2. URL Decoding
  • Apply URL decoding to obtain a de-obfuscated string.
5.2.3. Generalization and Tokenization
  • Replace words that have different syntax but the same meaning with a unique word.
  • Tokenize the generalized HTTP web request using a custom function.
5.2.4. Word2vec Based Vectorization
  • Convert each word in the array returned by the tokenizer to a vector of d numerical values, d is the embedding dimension.
  • If a web request contains s words, the embedding matrix dimension is s \times d.
  • Distinguish between static and non-static word embedding.
    • Static word embedding: vector representations of words do not change during the training process
    • Non-static word embedding: vector representations of words are modified during the training process
  • Use static word embedding to avoid overfitting the training data.
  • Create an index for words that are part of the training set only.

5.3. Evaluation Metrics

  • Use four performance metrics:
    • Accuracy.
    • Precision.
    • Recall.
    • F1-score.

5.4. Deep Learning Models Architecture

  • Develop four deep learning models:
    • Two models use CNN and LSTM networks.
    • The other two use stacking ensemble of LSTM or CNN networks.
  • The models take as input different word-level embeddings obtained from various Word2vec models.
  • The models differ in that the two first models operate on a single word-level embedding at a time, while the two others operate on multiple word-level embeddings simultaneously.
  • Use layer normalization and dropout for regularization.

5.5. Experimental Environment

  • Implemented using Keras, TensorFlow, gensim, scikit-learn, numpy, and pandas.
  • Run on Google Collaboratory platform.
  • Table 2 and 3 show parameter settings.

5.6. Results

5.6.1. A. Experiment 1
  • Implemented a 3-layers CNN and applied different window sizes to train the word2vec model.
  • Embedding dimension equal to the number of feature maps.
  • Performance improves with increasing window sizes.
5.6.2. B. Experiment 2
  • Implemented a CNN stacking ensemble model.
  • Varied the embedding dimension or the window size.
  • No significant difference in classification results when different window sizes or embedding dimensions are used.
  • Training time increases significantly with larger embedding dimensions and feature maps.
5.6.3. C. Experiment 3
  • Implemented a 3-Layers LSTM model and four different window sizes (2, 3, 4, and 5).
  • Three different embedding dimensions; 128, 256, and 512.
  • Classification results improve when we use the window size that suits the best the chosen embedding dimension.
  • The results get better with increasing window sizes.
5.6.4. D. Experiment 4
  • Implemented an LSTM stacking ensemble model.
  • Changed the embedding dimension, the window size, or both.
  • Using different window sizes and embedding dimensions is better than using the same window size or embedding dimension.
  • It is important to find the good combination between embedding dimension, window size and number of LSTM units.

5.7. Discussion

  • Compared LSTM and CNN models used as individual classifiers and as part of a stacked generalization ensemble model.
  • Generated various word embeddings by training word2vec models with different configurations of training parameters.
  • Training stacking ensemble models is faster than training LSTM or CNN models separately.
  • Training LSTM models is faster than training CNN models.
  • Stacking ensemble for LSTM models performs better with 78,95% accuracy,81.54% precision, 78.41% recall, 77.57% F1-score against 78,25% accuracy, 80.94% precision, 77.56% recall, 77.65% F1-score for LSTM model.
  • The stacking ensemble for CNN models is better than the CNN model with 78,24% accuracy , 81.5% precision, 78.4% recall, 77.5% F1-score against 78,00% accuracy,81,4% precision, 77.17% recall, 75.6% F1-score.
  • Combining different Word2vec models strengths the classification results.
  • LSTM model and stacking ensemble for LSTM models outperform both CNN model and stacking ensemble for CNN models.
  • It is important to find out the best combination between the embedding dimension, the window size and the number of LSTM units or the kernel size and the number of feature maps in CNN layers.

6. Conclusion

  • Developed a stacked generalization ensemble for LSTMs to detect malicious HTTP requests.
  • Used word2vec models with different training parameter configurations.
  • This model has the best classification results and the least training time compared with CNN, LSTM networks, and stacking ensemble for CNN models.
  • Future work:
    • Study how to leverage other sentence classification techniques to enhance the detection of web attacks.
    • Evaluate the performance of the proposed model in other information security contexts.