GitHub - DanielLin1986/function_representation_learning: Vulnerability Discovery with Function Representation Learning from Unlabeled Projects

Function Representation Learning for Vulnerability Discovery

Hi there, welcome to this page!

The page contains the code and data used in the paper Vulnerability Discovery with Function Representation Learning from Unlabeled Projects by Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan and Yang Xiang.

Requirements:

Tensorflow
Keras
Python >= 2.7
CodeSensor

The dependencies can be installed using Anaconda. For example:

$ bash Anaconda3-5.0.1-Linux-x86_64.sh

Instructions:

The Vulnerabilities_info.xlsx file contains information of the collected function-level vulnerabilities. These vulnerabilities are from 3 open source projects: FFmpeg, LibTIFF and LibPNG. And vulnerability information was collected from National Vulnerability Database(NVD) until the mid of July 2017.

The "Data" folder contains the source code of vulnerable functions and non-vulnerable functions within the Zip file of the 3 projects. After unzipping the files, one will find that the source code of each vulnerable function was named with its CVE ID. For the non-vulnerable functions, they were named with "{filename}_{functionname}.c" format.

The "Code" folder contains the Python code samples.

ProcessCFilesWithCodeSensor.py file is for invoking the CodeSensor to parse functions to ASTs in serialized format (for detail information and usage of CodeSensor, please visit the author's blog: http://codeexploration.blogspot.com.au/ for more details).
ProcessRawASTs_DFT.py file is to process the output of ProcessCFilesWithCodeSensor.py and convert the serialized ASTs to textual vectors.
BlurProjectSpecific.py file is to blur the project specific content and convert the textual vectors (the output of ProcessRawASTs_DFT.py) to numeric vectors which can be used as the input of ML algorithms.
LSTM.py file contains the Python code sample for implementing LSTM network based on Keras with Tensorflow backend.

We used Understand which is a commercial code enhancement tool for extracting function-level code metrics. In CodeMetrics.xlsx file, we include 23 code metrics extracted from the vulnerable functions of 3 projects.

Possible Future Work

In our paper, we randomly selected one code metric which was the "essential complexity" as the proxy (used as the substitute of the actual label). It will be interesting to examine whether the performance can be further improved when combining multiple code metrics, since multiple code metrics can provide more information and are more indicative of potential vulnerability (i.e. overly complex code are difficult to understand, therefore harder to debug and maintain).
The proposed LSTM network structure is fairly simple. We believe that the performance can be improved by implementing more complex network structure. For instance, adding pooling layers and/or dropout. One can even try the attention mechanism with LSTM.

Contact:

You are welcomed to improve our code as well as our method. Please kindly cite our paper if you use the code/data in your work. For acquiring more data or inquiries, please contact: [email protected].

Thanks!

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Code		Code
Data		Data
README.md		README.md
Vulnerabilities_info.xlsx		Vulnerabilities_info.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code

Code

Data

Data

README.md

README.md

Vulnerabilities_info.xlsx

Vulnerabilities_info.xlsx

Repository files navigation

Function Representation Learning for Vulnerability Discovery

Requirements:

Instructions:

Possible Future Work

Contact:

About

Releases

Packages

Languages

DanielLin1986/function_representation_learning

Folders and files

Latest commit

History

Repository files navigation

Function Representation Learning for Vulnerability Discovery

Requirements:

Instructions:

Possible Future Work

Contact:

About

Topics

Resources

Stars

Watchers

Forks

Languages