Security and Surveillance

ML Beginner's Guide to DDoS Attack Detection model

Sumit Singh

Feb 28, 2024 • 8 min read

ML Beginner's Guide to DDoS Attack Detection model

Introduction

Consider a popular online gaming platform hosting a highly awaited multiplayer game launch. As soon as the game goes live, players find themselves unable to connect or experience severe lag and disconnections.

This is due to a DDoS attack targeting the game servers, flooding them with bogus traffic, and preventing legitimate players from enjoying the game.

DDoS or Distributed Denial of Service attacks are malicious attempts to disrupt online services by overwhelming them with a flood of traffic from multiple sources.

These attacks render the targeted services unavailable to legitimate users, causing downtime, financial losses, and reputational damage to businesses.

Unlike data breaches or malware infections, DDoS attacks aim to disrupt services rather than steal or compromise data. They achieve this by flooding target systems with massive amounts of traffic, making them inaccessible to legitimate users.

The motivations include disrupting services or causing financial harm. For example, a competitor might launch a DDoS attack against a rival's e-commerce website during a peak sales period to disrupt their business operations and gain a competitive advantage.

Additionally, cyber criminals may extort money from organizations by threatening DDoS attacks unless a ransom is paid, causing financial losses and reputational damage.

Traditional Methods for DDoS Detection

Traffic analysis

Traffic analysis involves monitoring network traffic to detect abnormal patterns or anomalies that may indicate a DDoS attack. This can include analyzing factors such as packet rates, packet sizes, and traffic sources.

While traffic analysis can be effective in detecting certain types of DDoS attacks, it may struggle to differentiate between legitimate spikes in traffic (e.g., due to viral content or marketing campaigns) and actual attacks.

Signature-based detection

Signature-based detection relies on predefined patterns or signatures of known DDoS attacks. When incoming traffic matches these signatures, it triggers an alert or mitigation response.

This method is limited by its reliance on predefined signatures, making it ineffective against novel or zero-day DDoS attacks that do not match known patterns.

Threshold-based detection

Threshold-based detection sets predefined thresholds for various network metrics, such as bandwidth usage or connection rates. When incoming traffic exceeds these thresholds, it indicates a potential DDoS attack.

Threshold-based detection may generate false positives during legitimate traffic spikes, leading to unnecessary alerts or mitigation actions. Moreover, it may not be adaptive enough to dynamically adjust thresholds in response to changing network conditions or evolving attack techniques.

Role of Machine Learning in DDoS Detection

Machine learning algorithms play a crucial role in DDoS detection by analyzing network traffic patterns to identify anomalies indicative of attacks. These algorithms learn from historical data to distinguish between normal and malicious traffic, enabling early detection and mitigation of DDoS attacks.

Machine learning techniques:

Pattern Recognition: They learn patterns of normal network behavior and flag deviations from these patterns as potential anomalies.
Anomaly Detection: Machine learning models can detect unusual traffic patterns that may indicate DDoS attacks, such as sudden spikes in traffic volume or abnormal communication patterns.
Behavioral Analysis: By analyzing various features of network traffic, such as packet sizes, protocols, and traffic sources, machine learning algorithms can identify suspicious behavior associated with DDoS attacks.

The advantages include:

Adaptability: Machine learning models can adapt to evolving attack strategies and new attack vectors by continuously learning.
Scalability: Machine learning algorithms can analyze large volumes of network traffic in real-time, making them suitable for detecting DDoS attacks in high-speed networks.

Building a DDoS Detection Model

Importing Dataset

To import the dataset, download the CICIDS2017 dataset, which contains both benign and common attack traffic captured over a 5-day period. It contains a comprehensive range of benign and attack traffic, including common DDoS attack types such as DoS and DDoS.

Additionally, it provides labeled flows and detailed network flow features extracted using CIC FlowMeter, which are essential for training and evaluating machine learning models for DDoS detection.

import pandas as pd
import numpy as np

df=pd.read_csv("DDos.csv")

Data Pre-processing

df.columns = df.columns.str.strip()
df.loc[:,'Label'].unique()

'BENIGN': This represents benign or normal network traffic.

'DDoS': This indicates instances of DDoS attacks in the dataset.

df=df.dropna() # removing null values

(df.dtypes=='object') #checking column data types

Encoding the categorical values to numerical:

df['Label'] = df['Label'].map({'BENIGN': 0, 'DDoS': 1})

import matplotlib.pyplot as plt

plt.bar([0, 1], df['Label'].value_counts(), edgecolor='black')
plt.xticks([0, 1], labels=['BENIGN=0', 'DDoS=1'])
plt.xlabel('Classes')
plt.ylabel('Count')
plt.title('Distribution of Classes')
plt.show()

The above figure shows that the dataset is fairly balanced.

Data Splitting

Splitting the dataset using the train_test_split function from sklearn module.

from sklearn.model_selection import train_test_split
X = data_f.drop('Label', axis=1)
y = data_f['Label']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

print("The train dataset size = ",X_train.shape)
print("The test dataset size = ",X_test.shape)

The train dataset size = (46365, 78)

The test dataset size = (19871, 78)

Random Forest

Random forests are a popular choice for DDoS attack detection due to their ability to handle high-dimensional data, capture complex relationships between features, and reduce overfitting.

By combining predictions from multiple decision trees, random forests can provide robust and accurate classification results.

RandomForestClassifier from scikit-learn is used to create a random forest model with 50 decision trees.

The model is then trained on the training data (X_train, y_train) and used to predict labels for the testing data (X_test), storing the predictions in rf_pred.

from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=50, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)

This snippet calculates and visualizes the feature importances of the random forest model. It ranks the importance of each feature based on how much it contributes to the model's predictions and displays this information in a horizontal bar plot, providing insights into which features are most influential for the DDoS attack detection model.

importances = rf_model.feature_importances_

indices = sorted(range(len(importances)), key=lambda i: importances[i], reverse=False)
feature_names = [i for i in indices] 

plt.figure(figsize=(8, 10))
plt.barh(range(X_train.shape[1]), importances[indices], align="center")
plt.yticks(range(X_train.shape[1]), feature_names)
plt.xlabel("Importance")
plt.title("Feature Importances")
plt.show()

The output plot is:

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_curve, auc, confusion_matrix

print('\nRandom Forest Metrics:')
print(f'Accuracy: {accuracy_score(y_test, rf_pred).4f}')
print(f'F1 Score: {f1_score(y_test, rf_pred).4f}')
print(f'Precision: {precision_score(y_test, rf_pred).4f}')
print(f'Recall: {recall_score(y_test, rf_pred).4f}')

Logistic Regression

Logistic Regression is chosen for its interpretability, computational efficiency, and suitability for datasets with linearly separable classes. It serves as a baseline model for comparison and scales well to high-dimensional data.

from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)

print('\nLogistic Regression Metrics:')
print(f'Accuracy: {accuracy_score(y_test, lr_pred).4f}')
print(f'F1 Score: {f1_score(y_test, lr_pred).4f}')
print(f'Precision: {precision_score(y_test, lr_pred).4f}')
print(f'Recall: {recall_score(y_test, lr_pred).4f}')

Neural Network

Multi-Layer Perceptron (MLP) is a powerful algorithm capable of learning complex patterns in data.

from sklearn.neural_network import MLPClassifier

nn_model = MLPClassifier(hidden_layer_sizes=(10,), max_iter=10, random_state=42)
nn_model.fit(X_train, y_train)
nn_pred = nn_model.predict(X_test)

print('\nNeural Network Metrics:')
print(f'Accuracy: {accuracy_score(y_test, nn_pred).4f}
print(f'F1 Score: {f1_score(y_test, nn_pred).4f}
print(f'Precision: {precision_score(y_test, nn_pred).4f}
print(f'Recall: {recall_score(y_test, nn_pred).4f}

Model Comparison

Calculating the predicted probabilities of class membership to evaluate the models' performance using the Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC) metrics.

This allows us to assess the models' ability to discriminate between classes and compare their performance in distinguishing between benign and DDoS network traffic.

from sklearn.metrics import roc_curve, auc
rf_proba = rf_model.predict_proba(X_test)
lr_proba = lr_model.predict_proba(X_test)
nn_proba = nn_model.predict_proba(X_test)

rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_proba[:, 1])
rf_auc = auc(rf_fpr, rf_tpr)


lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_proba[:, 1])
lr_auc = auc(lr_fpr, lr_tpr)


nn_fpr, nn_tpr, _ = roc_curve(y_test, nn_proba[:, 1])
nn_auc = auc(nn_fpr, nn_tpr)


plt.figure(figsize=(8, 6))
plt.plot(rf_fpr, rf_tpr, label=f'Random Forest (AUC = {rf_auc:.2f})')
plt.plot(lr_fpr, lr_tpr, label=f'Logistic Regression (AUC = {lr_auc:.2f})')
plt.plot(nn_fpr, nn_tpr, label=f'Neural Network (AUC = {nn_auc:.2f})')


plt.plot([0, 1], [0, 1], linestyle='--', color='black', label='Random Classifier (AUC = 0.50)')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.grid()
plt.show()

Random Forest outperforms both Logistic Regression and Neural Network models in detecting DDoS attacks, as evidenced by its higher Accuracy, F1 Score, Precision, and Recall.

This superior performance is reflected in its ROC curve, with a larger area under the curve, indicating better overall performance.

Real World Implementation

Integrate the detection model with existing network infrastructure, such as firewalls, intrusion detection systems, and network monitoring tools.

Ensure compatibility with network protocols and communication standards to facilitate seamless integration.

Configure the model to generate alerts or notifications when suspicious or anomalous patterns indicative of DDoS attacks are detected.

Limitations of using Machine Learning

Data Quality and Quantity: Machine learning algorithms require large volumes of high-quality labeled data for effective training. However, obtaining labeled DDoS attack data can be challenging due to the rarity of attacks and the need for accurate ground truth labels.

Imbalanced Data: DDoS attacks are relatively rare compared to normal network traffic, leading to class imbalance issues in the training dataset. This imbalance can bias the model.

Conclusion

In conclusion, machine learning enhances DDoS attack detection. Simple algorithms like Logistic Regression, Random Forest, and Neural Networks play pivotal roles.

Their combined implementation empowers organizations to detect and mitigate attacks, safeguarding critical online services and infrastructure in evolving threat landscapes.

Frequently Asked Questions

How are DDoS attacks detected?

DDoS attacks are detected through traffic analysis, where abnormal patterns are identified. This involves monitoring incoming traffic and looking for sudden increases in volume or unusual behavior. Machine learning algorithms enhance detection accuracy by autonomously learning from historical data.

Which machine learning algorithms are used to detect DDoS attacks in SDN?

In Software-Defined Networking (SDN), machine learning algorithms like decision trees or neural networks are utilized for DDoS attack detection. These algorithms analyze network flow data and classify traffic as normal or malicious, enabling rapid response to mitigate attacks.

Looking for high quality training data to train your DDoS attack detection model? Talk to our team to get a tool demo.

Train Your Vision/NLP/LLM Models 10X Faster

Book our demo with one of our product specialist

Book a Demo