Tutorial

SourcererCC is Sourcerer's token-based code clone detector for very large code bases and Internet-scale project repositories. SourcererCC works at many levels of granularity such as detecting clones between files, methods, statements or blocks, in any language. This tutorial is for file-level clone detection on Java.

Additional Resources:

For more information about SourcererCC please see the ICSE'16 paper.
SourcererCC supports DéjàVu, a large scale study of cloning on GitHub. It has a homepage, and was published at OOPSLA'17
DéjàVu is a supporting web-tool to allow quick and simple clone analysis, can be found here.

Before going through:

We have created an artifact in the form of a virtual machine (VM) that contains the pre-programmed set of instructions that take the user from raw source code to a database with a clone mapping, including all the intermediate steps and explanation of intermediate data types. It can be downloaded from the 'Source Materials' section of the paper ACM website or from the DéjàVu homepage (only the latter is kept updated). This VM is the easiest way to get started with SourcererCC to perform your own clone analysis. It has most of the information here. Please try this VM before contacting us.

Let's get started.

bookkeeping_projs/ - contains a list of processed projects. Has the following format: project id, project path, project url
files_stats/ - contains lists of files together with various statistics. Has the following format: file id,project id,project path,project url,file hash,size bytes,lines,LOC,SLOC
files_tokens/ - contains lists of files together with various statistics and the tokenized forms. Has the following format: file id,project id,total tokens,unique tokens,token hash@#@token1@@::@@frequency,token2@@::@@frequency,...

The elements file id and project id always point to the same source code file or project, respectively (they work as a primary key). So a line in files_stats/* that start with 1,1 represents the same file as the line in files_tokens/* that starts with 1,1, and these came from the project in bookkeeping_projs/* whose line starts with 1. The number of lines in bookkeeping_projs/* corresponds to the total number of projects analyzed, the number of lines in files_stats/* is the same as files_tokens/* and is the same as the total number of files obtained from the projects.

Run SourcererCC

For this step we will run SourcererCC, which can be found here.

Start with files_tokens/ from the previous step:

cat files_tokens/* > blocks.file
cp blocks.file SourcererCC/clone-detector/input/dataset/

Inside clone-detector/ it is worth looking at sourcerer-cc.properties, in particular at:

# Ignore all files outside these bounds
MIN_TOKENS=65
MAX_TOKENS=500000

where you can set an upper and lower bound for file clone detection. You can dismiss the other parameters for now.

To change the percentage of clone similarity, look at runnodes.sh, line 9:

threshold="${3:-8}"

where 8 means clones will be flagged at 80% similarity (current setup), 7 at 70%, and so on. The JVM parameters can be configured in the same file, at line 20.

Finally, run:

python controller.py

This tool splits the task by multiple nodes, which must be aggregated in the end:

cat clone-detector/NODE_*/output8.0/query_* > results.pairs

The resulting information is a list of file id pairs which are clones. These ids correspond to the ids generated in the tokenization phase. An example output is:

1,2
2,3

In this case we have the clone pairs (1,2) and (2,3). To know which file corresponds to 1, we can look at the folder files_stats/* and look for the line with the unique id 1.

I want to know more!

That is great 👍 In the VM we refer to above you can find instructions and programs to import everything into an easily queryable database and perform statistic analysis on this information. Our OOPSLA'17 paper is a great way to understand out typical pipeline and which kind of results you can obtain. Finally, if you have any question or need more technical help (tweaking performance parameters for you hardware, for example), feel free to contact us.

Name		Name	Last commit message	Last commit date
Latest commit History 509 Commits
WebApp		WebApp
clone-detector		clone-detector
scripts-data-analysis		scripts-data-analysis
tokenizers		tokenizers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebApp

WebApp

clone-detector

clone-detector

scripts-data-analysis

scripts-data-analysis

tokenizers

tokenizers

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Tutorial

Additional Resources:

Before going through:

Table of Contents

Tokenize source code:

Run SourcererCC

I want to know more!

About

Releases 1

Packages

Contributors 10

Languages

License

Mondego/SourcererCC

Folders and files

Latest commit

History

Repository files navigation

Tutorial

Additional Resources:

Before going through:

Table of Contents

Tokenize source code:

Run SourcererCC

I want to know more!

About

Resources

License

Stars

Watchers

Forks

Languages