[CPD] How to determine the code duplication? #3049

cyw3 · 2017-11-30T01:45:11Z

cyw3
Nov 30, 2017

I want to know about CPD's principle.

I guess:

base on tokens's hashcode?
base on code's AST?

Nov 30, 2017

@cyw3 that's an interesting question, with a hateful answer: it depends.

At the core, CPD takes tokens from a "lexer source", and hashes the contents on a rolling window of the requested length via the --minimum-tokens argument. The general matching algorithm is implemented on MatchAlgorithm.

However, this "lexer source" is provided by each language module, and the particulars may vary.

The basic lexer (AnyTokenizer) just tokenizes the text of the analyzed file with no knowledge of the grammar itself. No language officially supported by PMD uses this, but it can be used to analyze any text file normalizing whitespace.

Officially supported languages use actual lexers for the language, whic…

View full answer

cyw3 · 2017-11-30T01:51:13Z

cyw3
Nov 30, 2017
Author

And if it base on tokens's hashcode, how can i get the hashcode from the cpd?

0 replies

jsotuyod · 2017-11-30T02:05:26Z

jsotuyod
Nov 30, 2017
Maintainer

@cyw3 that's an interesting question, with a hateful answer: it depends.

At the core, CPD takes tokens from a "lexer source", and hashes the contents on a rolling window of the requested length via the --minimum-tokens argument. The general matching algorithm is implemented on MatchAlgorithm.

However, this "lexer source" is provided by each language module, and the particulars may vary.

The basic lexer (AnyTokenizer) just tokenizes the text of the analyzed file with no knowledge of the grammar itself. No language officially supported by PMD uses this, but it can be used to analyze any text file normalizing whitespace.

Officially supported languages use actual lexers for the language, which not only normalizes whitespace, but actually gives us one extra benefit: we can ignore comments.

Moreover, some languages further refine these token sources, allowing extra behaviors, such as:

normalizing names of identifiers, to detect identical code where only names of variables / methods change (--ignore-identifiers on Java)
normalizing constant values of all types, to detect identical code where only constants are different (--ignore-literals on Java)
discarding annotation tokens when comparing code (--ignore-annotations on Java)
discarding using XXX directives when comparing code (--ignore-usings on C#)
discarding blocks under conditional compilation when comparing code (--skip-blocks-pattern on C/C++)

So, bottom line:

we hash a rolling window of tokens
tokenization is based on language features
additional logic may be applied to sanitize / alter the token stream to be harder on analysis or avoid reporting on code that has no way to be de-duped.

As for your last question, the hash of each window is not part of the public API, just computed during analysis and discarded immediately afterwards. It can be obtained directly from the TokenEntry.hashCode() method; but you would have to manually run CPD logic, and revisit the token stream before shutting down to be able to do so.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPD] How to determine the code duplication? #3049

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

[CPD] How to determine the code duplication? #3049

cyw3 Nov 30, 2017

Replies: 2 comments

cyw3 Nov 30, 2017 Author

jsotuyod Nov 30, 2017 Maintainer

cyw3
Nov 30, 2017

cyw3
Nov 30, 2017
Author

jsotuyod
Nov 30, 2017
Maintainer