multimodal transformers vs. classical transformers, part 1 : intro

4 min readJan 23, 2025

imagine a financial analyst who can seamlessly analyze thousands of reports, historical time series, and technical charts to identify a hidden investment opportunity. or a cybersecurity engineer connecting system logs, network flows, and memory anomalies to detect a sophisticated attack in real-time.
this is the power of multimodel transformers: a groundbreaking leap in AI, capable of combining diverse data sources to solve today’s most complex problems.

Classical Transformers:

Description: Classical transformers handle a single modality, making them suitable for homogeneous data types (ex, system logs in text format or syscall sequences).
Example: Using a classical transformer to detect anomalies in endpoint event logs.
Scenario: Monitoring PowerShell event logs on an endpoint. A classical transformer analyzes sequences of commands to identify suspicious activity, such as obfuscated scripts or commands that download files from malicious URLs.
Result: The model flags suspicious commands like Invoke-Expression (New-Object Net.WebClient).DownloadString...).

Multimodal Transformers:

Multimodal transformers combine multiple data sources simultaneously, using cross-attention layers to relate one modality to another (ex : system logs and network anomalies).

Scenario: Detecting a fileless malware attack in an enterprise environment.

Logs (text): A suspicious PowerShell command triggers a new process.
Network traffic (numerical data): Anomalous outbound communication with a C2 server.
Memory artifacts (binary/structured): Injection of a malicious executable segment into a legitimate process.

How It Works: The multimodal transformer integrates these data streams:

Connects the PowerShell log entry to the memory dump revealing a suspicious injected segment.
Links this activity to outbound HTTPS traffic towards an unregistered domain.
Flags the event as a coordinated attack across multiple modalities.

Challenges

Complex Alignment

Aligning data from different modalities, such as event logs (text) and memory dumps (binary), is challenging as they lack shared structures. Multimodal transformers must translate these into a shared latent space.

Example 1: For aligning a cat image with a text caption like “A cat playing in a garden,” the model must identify visual features (the cat) and understand the semantics of the text (the relationship between “cat” and “garden”). Misalignment could occur if the model fails to recognize this link.
Example 2 (Cybersecurity): A fileless malware modifies a process in memory and executes malicious commands via PowerShell. The model needs to align PowerShell logs (text) with memory dump data (structured/binary) AND detect that the PowerShell command injected malicious code into a legitimate process.

Model Size

Multimodal transformers require significantly more parameters to process and integrate multiple modalities, increasing memory usage and inference time.

Example (Cybersecurity): A multimodal transformer trained to analyze PowerShell logs, memory dumps, and network traffic simultaneously will have far more parameters than a classical transformer analyzing only PowerShell logs.

Data Requirements

Multimodal transformers demand large, annotated, and well-aligned datasets for pretraining. For cybersecurity, this could include logs, binary files, and network artifacts labeled with attack-specific metadata.

Example (Cybersecurity) : Training a multimodal transformer to detect fileless malware might require:
Annotated PowerShell logs with labels for benign vs. malicious commands.
Memory dumps showing processes with injected code.
Network logs indicating malicious traffic patterns.

Interpretability

Multimodal transformers are often “black boxes,” meaning their decision-making processes are difficult to interpret. This is problematic in cybersecurity, where understanding why an alert is raised is crucial.

Example 1: If a model flags an image of a dog as “wild animal,” it might be unclear which feature led to this conclusion.
Example 2 (Cybersecurity): If a model detects a compromised process, it can be hard to determine whether this decision was based on logs, network anomalies, or both. This lack of transparency complicates incident response.

Design Differences Between Classical and Multimodal Transformers

Classical Transformers:

Focus on a single modality (e.g., text in GPT or BERT).
Use self-attention layers to extract contextual relationships within a single sequence.

Multimodal Transformers:

Add mechanisms to combine multiple modalities.
Often leverage cross-attention layers to enable interaction between text, visual, and numerical data.

Real-World Attack Scenario: Fileless Malware

Step 1: Exploitation

The attacker exploits a vulnerability in an application (ex : a web server) to inject code directly into memory, bypassing file-based detection systems.

Indicators :

System logs: Unusual API
Memory artifacts: A newly executable segment in a legitimate process, visible in memory dumps.

Step 2: Lateral Movement

The malware executes commands via PowerShell to escalate privileges or move laterally across the network, erasing traces.

Indicators:

PowerShell logs: Commands like Invoke-WebRequestInvoke-WebRequest for data exfiltration
Memory artifacts: Traces of executed commands remain in PowerShell’s memory but are absent from logs.

Step 3: Data Exfiltration

The malware communicates with a C2 server, often using encrypted traffic on non-standard ports.

Indicators:

Network traffic: Unusual DNS queries to recently registered domains or connections to suspicious IPs over uncommon ports (e.g., 4433).

How a Multimodal Transformer Handles This Scenario

Data Sources:

System logs: Detect unusual API calls.
Memory artifacts: Identify injected code segments.
PowerShell logs: Analyze commands executed and attempts to erase evidence.
Network traffic: Detect DNS anomalies and suspicious IP communications.

Reasoning:

The transformer links API calls to injected memory segments.
Correlates these findings with PowerShell commands for exfiltration.
Identifies network activity matching the exfiltration timeline, confirming a coordinated attack.

Conclusion

Multimodal transformers excel in detecting such attacks by analyzing diverse data sources and identifying relationships that would be missed by classical models.

Closing Note :

This article was written with a strictly educational and defensive purpose, aiming to showcase the potential of multimodal transformers in detecting and preventing advanced threats. Any misuse or malicious application of the information contained in this article is strictly prohibited and goes against ethical principles as well as national and international laws. Cybersecurity is about protection, not exploitation. Let’s work together for a safer and more responsible cyberspace

Sirine Amrane

multimodal transformers vs. classical transformers, part 1 : intro

Classical Transformers:

Multimodal Transformers:

Challenges

Complex Alignment

Model Size

Data Requirements

Interpretability

Design Differences Between Classical and Multimodal Transformers

Classical Transformers:

Multimodal Transformers:

Real-World Attack Scenario: Fileless Malware

Step 1: Exploitation

Step 2: Lateral Movement

Step 3: Data Exfiltration

How a Multimodal Transformer Handles This Scenario

Conclusion

Written by Sirine Amrane

No responses yet