Hybrid models in Machine Learning

Sirine Amrane
4 min readJan 21, 2025

--

What is a hybrid model ?

A hybrid model combines multiple types of models to optimize accuracy/performance and resource usage. These approaches typically involve:

  • Lightweight models: Quickly filter simple cases.
  • Transformers, as an example: Perform deep analysis on complex cases.

When do you need to use a hybrid approach?

A hybrid approach is ideal when:

Simple Data Dominates:

  • Most of the input data consists of straightforward cases that can be handled efficiently by lightweight models.
  • This saves computational resources by avoiding unnecessary Transformer usage.

Resource Constraints:

  • Transformers require significant GPU and memory resources.
  • A hybrid approach ensures that only high-priority cases are sent to resource-intensive models.

Real-Time Detection:

  • Transformers are slower compared to lightweight models, making them less suitable for real-time applications.
  • For instance, in real-time cybersecurity log analysis, a lightweight model can act as a first line of defense, while the Transformer focuses on the more complex and ambiguous cases.

Example: combining Random Forest and Transformer for PowerShell command analysis

Workflow:

Step 1: Lightweight model for initial filtering

  • Use a model like Random Forest to quickly classify PowerShell commands.

=> Commands classified as “legitimate” are ignored.

=> Commands marked as “suspicious” are sent to the Transformer for further analysis.

Step 2: Transformer for complex cases

  • Use a Transformer (e.g., BERT or a fine-tuned GPT model) to deeply analyze suspicious commands.

=> The Transformer identifies hidden patterns and relationships, such as obfuscation or malicious intent.

Let’s do a pratical application:

Real-Time cybersecurity with powershell logs analysis:

  • Simple logs: Commands like Get-Process can be quickly classified as legitimate by the lightweight model (Random Forest).
  • Suspicious logs: Complex or ambiguous commands, such as Invoke-Mimikatz, are forwarded to the Transformer for detailed analysis.

Benefits of the Hybrid Approach with our case :

Efficiency:

  • Most cases are processed rapidly by the lightweight model, reducing the load on the Transformer.

Cost-Effectiveness:

  • Reduces the need for high computational power, as only a subset of data is analyzed by the Transformer.

Improved Accuracy:

  • Combines the strengths of both models: the speed of lightweight models and the depth of Transformers.

Stage 1: Lightweight Model for Initial Filtering

A lightweight model, such as a Random Forest, can be employed to analyze predefined features extracted from PowerShell commands. As we said, the primary goal is to identify simple cases efficiently. Tokenization plays a crucial role in enriching the features provided to the lightweight model.

Role of Tokenization in This Stage:

Breaking Down Commands into Meaningful Tokens:

  • Example Command: Get-Process -Name "Notepad"
  • Tokens: "Get", "-", "Process", "-", "Name", "Notepad"

Feature vector creation for Random Forest:

  • Examples of features :
  • Token count: Total number of tokens in the command.
  • Frequency of suspicious words: E.g., terms like “Invoke”, “Download”.
  • Presence of specific characters: Indicators such as -, /, $, ;.
  • Command length: Total number of characters in the command.

Tokenizer options:

  • Simple tokenizers (e.g., space-based or rule-based) can be used to generate features quickly without introducing computational overhead.

Stage 2: Transformer for complex cases

For analyzing more complex or ambiguous PowerShell commands, a Transformer-based model can be used. This approach relies heavily on advanced tokenization methods to extract deep contextual insights.

Role of Tokenization in this stage:

Subword Tokenization:

  • Methods like Byte Pair Encoding (BPE) or WordPiece are used to tokenize commands into subwords. These methods capture contextual relationships even for rare or obfuscated terms.
  • Example with BPE:

=> Command: Invoke-Mimikatz -Command "dump"

=> Tokens: "Invoke", "-", "Mimi", "katz", "-", "Command", "dump"

Handling Ambiguous Commands:

  • Subword tokenization is effective for identifying obfuscated or disguised malicious commands by focusing on critical substrings.

Positional Encoding:

  • Transformers add positional information to tokens, enabling the model to understand the order and syntactic structure of commands.

Complete Pipeline for the Transformer:

1. Advanced Tokenization:

  • Use a pre-trained tokenizer, such as BERT’s tokenizer: from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") tokens = tokenizer("Invoke-Mimikatz -Command dump", return_tensors="pt")

2. Embedding Creation:

  • Convert tokens into dense vectors using a Transformer model: from transformers import BertModel bert_model = BertModel.from_pretrained("bert-base-uncased") outputs = bert_model(**tokens) embeddings = outputs.last_hidden_state

3. Contextual Analysis:

  • Use the embeddings to detect hidden patterns or malicious sequences:
  • Example: Feed embeddings into additional classification layers to identify suspicious patterns.

Feature Integration Across Models:

Compatibility Between Stages:

  • Use the same tokenizer (or a compatible one) for both the Random Forest and Transformer models to ensure consistency in feature extraction.

Feature Enrichment for the Lightweight Model:

  • Include advanced tokenization outputs (e.g., token length, number of subwords) as features in the Random Forest model.
  • Example: aword split into multiple subwords could indicate obfuscation.

Targeting Obfuscation:

  • Leverage tokenization to identify:
  • Rare or combined terms (e.g., Base64Decode).
  • Non-standard characters (e.g., ;, $, %).

In conclusion, the hybrid approach — combining lightweight models for rapid filtering and Transformers for deep analysis — is highly effective for tasks like real-time log analysis, as demonstrated in the PowerShell command detection example.

However, this is obviously just one application, hybrid models can be adapted to many other domains, such as fraud detection, natural language processing, or predictive maintenance, wherever efficiency and precision need to be balanced :)

Hybrid approaches utilizing tokenization are particularly well-suited for detecting sophisticated attacks involving partially obfuscated or encrypted PowerShell commands.

Sirine Amrane

--

--

Sirine Amrane
Sirine Amrane

No responses yet