PythonCloneDetection

Detect semantically similar python code using fine-tuned GraphCodeBERT model.

About

This modified GraphCodeBERT model was fine-tuned for 11 hours using an A40 server on the PoolC (1fold) dataset, which contains over 6M pairs of semantically similar python code snippets.

It is then used to predict the similarity of python code snippets in other folds of the PoolC dataset, as well as the C4 dataset. It achieved F1 scores of greater than 0.96 on all datasets in several experiments, where balanced sampling was applied.

Prerequisites & Installation

pip

In your virtual environment, run:
```
pip install -r requirements.txt
```
to install the required packages.
conda

To create a new conda environment called PythonCloneDetection with the required packages, run:
```
conda env create -f environment.yml
```
(this may take a while to finish)

The above commands will install cpu-only version of the pytorch package. Please refer to PyTorch's official website for instructions on how to install other versions of pytorch on your machine.

Usage

Run python main.py --input <input_path> --output <output_path> to run CloneClassifier on the csv file at <input_path> and save its predictions at <output_path>. For example:
```
python main.py --input examples/c4.csv --output results/res.csv
```
The input of main.py is a csv file containing two columns named code1 and code2, where each row contains a pair of python code snippets to be compared. The output csv file will have three columns named code1, code2, and predictions, where predictions indicates whether the two code snippets in the corresponding row are semantically similar.
Use the command python main.py --help to see other optional arguments including max_token_size, fp16, and per_device_eval_batch_size.

You could also import CloneClassifier class from clone_classifier.py and use it in your own code, for example:

import pandas as pd
from clone_classifier import CloneClassifier


classifier = CloneClassifier(
    max_token_size=512,
    fp16=False,  # set to True for faster inference if available
    per_device_eval_batch_size=8,
)

df = pd.read_csv("examples/c4.csv").head(10)
res_df = classifier.predict(
    df[["code1", "code2"]], 
    # save_path="results/res.csv"
)

print(res_df["predictions"] == df["similar"])

License

Distributed under the MIT License. See LICENSE for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

results

results

utils

utils

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

clone_classifier.py

clone_classifier.py

environment.yml

environment.yml

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

PythonCloneDetection

About

Prerequisites & Installation

Usage

License

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
examples		examples
results		results
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clone_classifier.py		clone_classifier.py
environment.yml		environment.yml
main.py		main.py
requirements.txt		requirements.txt

License

RepoAnalysis/PythonCloneDetection

Folders and files

Latest commit

History

Repository files navigation

PythonCloneDetection

About

Prerequisites & Installation

Usage

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Languages