MIDS Capstone Project Spring 2025

MalwareLens

Team members

Problem & Motivation

In today’s rapidly evolving cybersecurity landscape, the volume and sophistication of malware threats continue to outpace traditional detection methods. Security analysts are overwhelmed, and many organizations, especially small to mid-sized businesses, lack the resources or expertise to identify and respond to malicious files effectively. Meanwhile, existing malware detection tools often require advanced knowledge, leaving non-specialists and entry-level analysts at a significant disadvantage.

The challenge is clear: How can organizations of all sizes accurately detect and understand malware without relying on scarce expert resources? MalwareLens was created to solve this pressing issue, leveraging cutting-edge AI to democratize malware detection, reduce manual effort, and empower users with intuitive, real-time threat insights.

Data Source & Data Science Approach

MalwareLens leverages state-of-the-art AI to transform malware detection into a scalable, intuitive, and user-friendly experience. Here's how we built it:

Data Source:

Our pipeline is trained on a curated dataset of malware executables, primarily sourced from publicly available malware repositories. To simulate real-world diversity, we incorporated malware spanning various families such as ransomware, spyware, and trojans. Executable files were capped at 10MB to ensure fast processing and optimal cloud performance. The dataset underwent preprocessing to transform raw binaries into structured formats suitable for deep learning, including hexadecimal representations and frequency-domain transformations.

Model Development:

Our detection engine is powered by a Convolutional Neural Network (CNN) trained to classify malware using a novel image-based approach. Executable files are converted into visual representations through the Discrete Cosine Transform (DCT) equation

, enabling the model to detect patterns and anomalies in the frequency domain.

Key Steps in the Data Science Workflow:

Preprocessing & Transformation:
- Executable binaries are converted into bigram hex images.
- DCT is applied to map these hex structures into 2D images, enabling the CNN to learn visual patterns of malware.
Model Architecture:
- Best-performing architecture: CNN with 2 convolutional layers, achieving 95% test accuracy.
- Input: Transformed DCT images of binaries.
- Output: Malware classification percentage.
Loss Function & Training:
- Utilized standard cross-entropy loss with accuracy metrics for classification performance.
- Training was performed on AWS SageMaker with containerized models deployed via ECR for scalability.
Challenges & Tradeoffs:
- Limited access to recent malware datasets due to security restrictions.
- Tradeoff between visual pattern accuracy and raw hexadecimal fidelity in DCT transformation.
- Dataset aging (from 2021) offered scale but introduced concerns around relevancy.

Production-Grade Deployment:

The full AI pipeline is deployed on AWS Cloud Infrastructure, ensuring seamless, scalable service delivery:

Compute & Storage:
- AWS Fargate for serverless model execution.
- Amazon S3 for storing user-submitted files and model outputs.
Model Serving:
- Inference hosted on AWS SageMaker, enabling real-time predictions.
- API endpoints allow direct interaction between the LLM agent and malware classifiers.
Intelligent Agent Integration:
- A LangChain-powered LLM agent (Gemini) enables users to ask natural language questions about file structure, function calls, or other malware characteristics.
- Integration with tools like VirusTotal and Ghidra enhances static analysis.

Evaluation

Model performance was primarily assessed using accuracy on a labeled malware test dataset. The core evaluation metric reflects the proportion of correctly classified malware samples, where the model predicts the presence of malware. While this metric captures overall effectiveness, we also qualitatively reviewed the model’s interpretability.

To benchmark our performance, we compared our custom CNN with two convolutional layers against other conventional machine learning approaches found in academic literature. Our model achieved a 95% accuracy on the test dataset, closely aligning with the top results reported in similar studies using frequency-domain transformations such as DCT.

However, evaluation came with several caveats:

Data Constraints: Due to limited access to recent and diverse malware samples, we relied on a 2021 dataset. Although this provided volume, it may lack the latest malware variants.
Transformation Tradeoffs: Converting binaries into DCT-based image representations improved CNN compatibility but came at the cost of losing raw byte-level granularity, which may hinder the detection of certain obfuscation techniques.
Model Generalization: The image-based classification approach showed strong accuracy but may require further tuning when applied to novel or packed malware, where visual signatures deviate significantly.

Despite these constraints, the model demonstrated robust generalization within the test dataset. Its integration with a large language model (LLM) further enhances usability by offering interpretive, natural language insights, bridging the gap between technical analysis and human decision-making.

Key Learnings & Impact

Impact

MalwareLens empowers users, regardless of their technical background, to identify and understand malware through an intuitive, AI-powered platform.
CNN-based malware classification achieves high accuracy while reducing the manual effort typically required by cybersecurity analysts.
LLM Integration provides real-time, natural language insights into malware behavior, structure, and classification, bridging the cybersecurity skills gap for entry-level analysts and general users.
Cloud-Native Architecture enables scalable, accessible malware detection without requiring local software installation.

Top Technical Challenges

Data Availability: Access to recent and diverse malware datasets is limited due to strict containment policies and licensing restrictions.
Non-Standard Input Representation: Converting binary executables into frequency-domain images introduced transformation tradeoffs, sacrificing some raw byte fidelity for model compatibility.
Model Complexity vs. Interpretability: Balancing CNN performance with the ability to explain model outputs in an educational, user-friendly way.
Tool Integration: Customizing LangChain’s agent framework to support real-time tool invocation via LLMs while managing bugs in third-party libraries (e.g., HuggingFace tools).
Cloud Deployment: Deploying a secure, containerized architecture using AWS SageMaker, ECS, and Fargate posed technical and cost-related hurdles.

Future Work

Expand Dataset Coverage by incorporating newer malware variants and obfuscation techniques to improve real-world generalization.
Implement Dynamic Analysis through more advanced sandboxing capabilities that allow deeper behavioral tracking of suspicious files.
Improve LLM Agent Autonomy by refining the ReAct framework to reduce reliance on rigid tool-chaining and improve multi-step reasoning.
Introduce RAG-Enhanced Intelligence using retrieval-augmented generation (RAG) for the LLM to better source context from malware knowledge bases and documentation.
Optimize Model Efficiency to enable local, lightweight deployment for organizations with limited cloud access.

Acknowledgements

We extend our sincere gratitude to our project advisors and the UC Berkeley W210 course instructors for their invaluable guidance and support throughout this project. Special thanks to the organizations and open-source communities that provided access to malware datasets, tools like Ghidra, and APIs such as VirusTotal. We’d also like to thank the developers and contributors behind LangChain and AWS for their robust infrastructure and documentation. Lastly, we are grateful to cybersecurity professionals who shared their insights and feedback, helping us shape MalwareLens into a more impactful and user-centered solution.

References

Yuan, Ziang, Zhenguang Liu, and Yanjun Qi. 2021. “Image-based Malware Classification with Ensemble Learning.” arXiv. https://doi.org/10.48550/arXiv.2101.10578.

Course

Data Science 210. Capstone , Spring 2025

Class Project Gallery

Last updated: March 29, 2025