Skip to content

Vulnerability Discovery with Function Representation Learning from Unlabeled Projects

Notifications You must be signed in to change notification settings

DanielLin1986/function_representation_learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Function Representation Learning for Vulnerability Discovery

Hi there, welcome to this page!

The page contains the code and data used in the paper Vulnerability Discovery with Function Representation Learning from Unlabeled Projects by Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan and Yang Xiang.

Requirements:

The dependencies can be installed using Anaconda. For example:

$ bash Anaconda3-5.0.1-Linux-x86_64.sh

Instructions:

The Vulnerabilities_info.xlsx file contains information of the collected function-level vulnerabilities. These vulnerabilities are from 3 open source projects: FFmpeg, LibTIFF and LibPNG. And vulnerability information was collected from National Vulnerability Database(NVD) until the mid of July 2017.

The "Data" folder contains the source code of vulnerable functions and non-vulnerable functions within the Zip file of the 3 projects. After unzipping the files, one will find that the source code of each vulnerable function was named with its CVE ID. For the non-vulnerable functions, they were named with "{filename}_{functionname}.c" format.

The "Code" folder contains the Python code samples.

  1. ProcessCFilesWithCodeSensor.py file is for invoking the CodeSensor to parse functions to ASTs in serialized format (for detail information and usage of CodeSensor, please visit the author's blog: http://codeexploration.blogspot.com.au/ for more details).
  2. ProcessRawASTs_DFT.py file is to process the output of ProcessCFilesWithCodeSensor.py and convert the serialized ASTs to textual vectors.
  3. BlurProjectSpecific.py file is to blur the project specific content and convert the textual vectors (the output of ProcessRawASTs_DFT.py) to numeric vectors which can be used as the input of ML algorithms.
  4. LSTM.py file contains the Python code sample for implementing LSTM network based on Keras with Tensorflow backend.

We used Understand which is a commercial code enhancement tool for extracting function-level code metrics. In CodeMetrics.xlsx file, we include 23 code metrics extracted from the vulnerable functions of 3 projects.

Possible Future Work

  1. In our paper, we randomly selected one code metric which was the "essential complexity" as the proxy (used as the substitute of the actual label). It will be interesting to examine whether the performance can be further improved when combining multiple code metrics, since multiple code metrics can provide more information and are more indicative of potential vulnerability (i.e. overly complex code are difficult to understand, therefore harder to debug and maintain).

  2. The proposed LSTM network structure is fairly simple. We believe that the performance can be improved by implementing more complex network structure. For instance, adding pooling layers and/or dropout. One can even try the attention mechanism with LSTM.

Contact:

You are welcomed to improve our code as well as our method. Please kindly cite our paper if you use the code/data in your work. For acquiring more data or inquiries, please contact: [email protected].

Thanks!

About

Vulnerability Discovery with Function Representation Learning from Unlabeled Projects

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages