Traditional application security reliance on strict regular expressions (regex) or signature databases is rapidly becoming a bottleneck. Signature-based scanning is inherently reactive: if a vulnerability signature hasn’t been written yet, the scanner is blind to it. Furthermore, clever attackers can obfuscate web payloads with slight shifts, bypass regex patterns, and slide straight into backend databases.
At eOzka, we set out to build a lightweight, self-contained AI-powered vulnerability scanner. Our goal was to classify web payloads (such as query strings, HTTP body arguments, and header parameters) into categories—specifically SQL Injection (SQLi), Cross-Site Scripting (XSS), and Normal Traffic—with high accuracy and under 5ms inference latency, running entirely on-premise without relying on external, slow third-party API hooks.
1. Architecting the ML Pipeline
We chose a hybrid machine learning pipeline leveraging Natural Language Processing (NLP)for text representation, paired with a highly optimized Random Forest Classifier. Here is how the data moves through the engine:
- Text Preprocessing: Raw HTTP payloads are decoded, URL-escaped structures are cleaned, and standard HTML tags or SQL punctuation symbols are normalized.
- Feature Extraction (TF-IDF): We convert payload strings into numerical features using character-level N-grams (range 2 to 5). This captures sub-word patterns (like
<script,UNION SELECT, or1=1) even if obfuscated. - Classification (Random Forest): The vectorized payloads are processed by an ensemble of decision trees, calculating categorical probabilities.
2. The Source Code Implementation
Below is a high-level Python script demonstrating how our core classifier pipeline is trained and saved. It handles tokenizing character patterns and fitting the Random Forest model:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
import joblib
# Simulated training dataset: Decoded Web payloads
payloads = [
"SELECT * FROM users WHERE id = 1 UNION SELECT username, password FROM admin",
"<script>alert('XSS_injection')</script>",
"normal_query_string_parameter_value",
"john.doe@company.com",
"'; DROP TABLE logs; --",
"javascript:void(0)",
"const x = document.getElementById('user_input');",
"{"name": "Harsh", "role": "developer"}"
]
# Labels: 0 = Normal, 1 = SQLi, 2 = XSS
labels = [1, 2, 0, 0, 1, 2, 2, 0]
def build_scanner():
print("Initializing TF-IDF vectorizer + Random Forest pipeline...")
# Use character-level N-Grams to catch slight payload obfuscations
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(2, 5))
classifier = RandomForestClassifier(n_estimators=100, max_depth=15, random_state=42)
# Bundle into an atomic pipeline
pipeline = make_pipeline(vectorizer, classifier)
print("Fitting model to payload dataset...")
pipeline.fit(payloads, labels)
# Save the pipeline cleanly to disk
joblib.dump(pipeline, "ai_vulnerability_scanner.pkl")
print("✅ Pipeline model saved successfully as 'ai_vulnerability_scanner.pkl'")
if __name__ == "__main__":
build_scanner()
3. Performance and Scale Metrics
By implementing character-level tokenization rather than word-level tokenization, we insulated the scanner from evasion tactics. When tested against real-world datasets containing over 50,000 requests, the results were stellar:
4. Key Takeaways
This experiment demonstrated that applying machine learning models locally on web servers is fully viable and provides an iron-clad layer of security. Without standard database connections or cloud processing bottlenecks, we achieved real-time application threat mitigation at the edge. We are currently working to compile the Python scanner binary into a native C-extension, enabling direct integration into reverse proxies like Nginx or HAProxy.