Software Similarity Dataset

This project contains the source code for processing raw software into graph representation as well as calculating the similarity score between two software. Dataset is here.

Description

GraphRep

This Section introduce the process of how to convert and clean the raw softrware into a embedded graph representation dataset for software similarity.

Raw Data Obtain

For obtain the dataset, we use two essential tools: Somef and Inspect4py. The raw data is obtained by the github links provided by Paper With Code.

Somef: Automatically extracting relevant information from readme files and download software.
Inspect4py: Inspect python software and extract all info(call information).

Pre_Process

The folder of Pre_Process is aim to process the raw dataset into graph representation.

get_files_from_software.py: Get all the functions within one software with unique index name.
- Input: Result from Somef and Inspect4py
- Output: dir_info.json
- Format: [[python file name: python file location]]
process_paperwithcode.py: Get all the descritive data provided by paperwithcode to the according software
- Input: paperwithcode dataset
- Output: repo2edu.json
- Format: {"reference": Str， "abstract": Str, "task": List of Str, "author": Str, "url":link, "repo": Str}
data_process_to_graph.py: Process the data with dir_info.json and repo2edu.json into graph representation.
- Input: dir_info.json & repo2edu.json & Inpsect4py Result
- Output: Graph Representation
- Format:
  - line: List <this function line index within the file [num_1, num_2]>
  - call_info: List <all called functions and methods in this function(in and out this repo)>
  - type: Str <function or methods(from a class)>
  - repo_call_info: List <all called function within this repo, if none called --> ["None"]>
  - code_tokens: List

SimScore

This section is for getting the numerical similarity metric between two software using pre-train models provided by Huggingface. We have used SentenceBert, MiniLM and TSDAE.

get_final_abstract.py: Get a json file with all the abstract according to all the software
- Input: All software & repo2edu.json
- Output: matching_data.json
select_model_embedding.py: Use different models for embed the abstract for each software
CalSimScore.py: Calculate the cosine similarity between embedded abstract

AE_process

This section is dedicated for using autoencoder model for dealing with different length of functions(nodes) in the graph representation of software, an example is givne Github_Issue

pre_ae_data.py: this file use pre-trained model for encode the code_tokens into numerical embedding, within this file, we applied the UniXcoder model.
- Input: final_data
- Output: ae_data
get_ae_training_data.py: sample random 20000 UniXcoder embedded functions for training our autoencoder model. The length of each function length is set to 1024.
- Input: ae_data
- Output: data_1024
flat_ae_encode.py: This .py file trained the autoencoder model use data from data_1024 for future encode all functions within the dataset.
- Input: data_1024
- Output: A Autoencode Model
ae_process_data.py:This File use the trained model from flat_ae_encoder.py, process the entire dataset with autoencoder model. If the software(bert embedded) is over 300MB, we abandoned the software due to significance in size.
- Input: Autoencoder Model & ae_data
- Output: post_process(software count: 2001)

Model

We have applied the Graph Similarity GNN models over our dataset.

SimGNN: link

The orginal Github Repo is SimGNN, the modified version for our dataset is ../src/SimGNN/.

Run Command:

python src/test.py --data_path path_to_/post_process --json_path path_to_/final_data --score_path path_to_/lean_simcal.csv

Save Model:

python src/test.py --save_path path_to_/dir

Load Model:

python src/test.py --load_path path_to_/dir

Define Sim Socre: [sbert, miniLM, tsdae]

python src/test.py --sim_type sim_score_type

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.idea		.idea
GraphRep		GraphRep
data		data
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Software Similarity Dataset

Description

GraphRep

Raw Data Obtain

Pre_Process

SimScore

AE_process

Model

SimGNN: link

About

Releases

Packages

Languages

License

OEG-Clark/softsim

Folders and files

Latest commit

History

Repository files navigation

Software Similarity Dataset

Description

GraphRep

Raw Data Obtain

Pre_Process

SimScore

AE_process

Model

SimGNN: link

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages