On deceiving malware classification with section injection

This repo provides the official implementation for "On deceiving malware classification with section injection", available at: https://arxiv.org/abs/2208.06092

Installation

Clone the project.

git clone https://github.com/adeilsonsilva/malware-injection

Using the Docker Container

Copy your datasets to data directory, as the container will have a volume attached to it.

Run the following script to build and run the image:

./run.sh

If you run this script, you're all set to use the machine learning models. To use GIST, you're better off using the virtual environment (it requires some quirk and outdated libraries).

Using a virtual environment

python3 -m venv .
source bin/activate
pip3 install --user -r gist-requirements.txt

# * Do whatever you want *

deactivate # quit from venv

External Dependencies

Data injection

This repo depends on pe-modifier as git a submodule. Remember to install it by using:

git submodule update --init

Drivers NVIDIA

Our models were trained using Tensorflow GPU 2.3.0, which uses CUDA 10.1 [Source]. To proceed with instalation:

You can use nvidia-docker to run the provided container with host GPUs, assuming you have everything setup locally. Check out nvidia-docker from its source or install it using this script.
You can use this script to install all nvidia drives and cuda 10.1 locally on your machine.

Internal Libraries

Running without Docker

If you don't want to use docker (you should!), make sure to install following libraries:

python3
python3-pip
libfftw3-3

Then proceed with python requirements to use the machine learning models:

cd code
pip3 install -r requirements.txt

GIST

If you're interested in using GIST algorithm, install its dependencies:

cd code
pip3 install -r gist-requirements.txt
cd ../dependencies/pyleargist-2.0.5/
python3 setup.py build
python3 setup.py install --user

Usage

Running the code

This project is structured to use separate scripts. They are all in code directory, change to it in case you are not using the docker container.

The main scripts for training/handling data are inside src:

├── src
│   ├── gen_dataset_npz.py           # Converts a existing dataset to npz
│   ├── gen_headerless_dataset.py    # Generate a headerless version of the dataset
│   ├── gen_injected_dataset_npz.py  # Generates an injected dataset (.npz)
│   └── run_ml_model.py              # Main script used for training/testing.

You can also check models directory to check used architectures:

├── models
│   ├── Augmenter.py     # Module with code used for data augmentation (data injection/reordering)
│   ├── Chen2018.py      # Module with Inception architecture
│   ├── Data.py          # Main data handler module with various wrappers
│   ├── Le2018.py        # Module with cnn/lstm variations
│   ├── Nataraj2011.py   # Module with KNN

Citation

To cite the paper, kindly use the following BibTex entry:

@misc{Silva2022,
  doi = {10.48550/ARXIV.2208.06092},
  url = {https://arxiv.org/abs/2208.06092},
  author = {da Silva, Adeilson Antonio and Segundo, Mauricio Pamplona},
  keywords = {Cryptography and Security (cs.CR), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {On deceiving malware classification with section injection},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution Non Commercial No Derivatives 4.0 International}
}

Troubleshooting

A. My dataset is not loaded correctly

A.1 - The required architecture needed for the dataset handler is:

dataset/
├── benign
│   ├── sample1.exe
│   ├── ...
│   └── sampleN.exe
└── malware
    ├── sample1.exe
    ├── ...
    └── sampleN.exe

If you are not going for the binary problem check if your families are in the allowed list.

Different performances using png vs exe using BiCNN-LSTM

# Load as image
image          = cv2.imread(path_img, cv2.IMREAD_GRAYSCALE)
image_reshaped = image.reshape(image.shape[0]*image.shape[1], 1)
image_final    = cv2.resize(image_reshaped, (height, width))

# Load as exe
bin_stream          = np.fromfile(path_exe, dtype='uint8')
bin_stream_reshaped = bin_stream.reshape(bin_stream.shape[0], 1)
bin_final           = cv2.resize(bin_stream, (height, width))

Those methods may produce different results. np.fromfile is not adequate for opening png images, it does not read all bytes. Use it strictly for opening binary (or txt) files, as per its documentation
When converting to exe's to images using Nataraj's method, some bytes at the end of the file might be discarded, so if you load both an image and an exe using the methods above their results after reshaping/resizing might not be the same.

License

Copyright 2022 Adeilson Silva

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
code		code
data		data
dependencies/pyleargist-2.0.5		dependencies/pyleargist-2.0.5
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
run.sh		run.sh

License

adeilsonsilva/malware-injection

Folders and files

Latest commit

History

Repository files navigation

On deceiving malware classification with section injection

Installation

Using the Docker Container

Using a virtual environment

External Dependencies

Data injection

Drivers NVIDIA

Internal Libraries

Running without Docker

GIST

Usage

Running the code

Citation

Troubleshooting

A. My dataset is not loaded correctly

A.1 - The required architecture needed for the dataset handler is:

Different performances using png vs exe using BiCNN-LSTM

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages