Skip to content

[CPD] How to determine the code duplication? #3049

Answered by jsotuyod
cyw3 asked this question in Q&A
Discussion options

You must be logged in to vote

@cyw3 that's an interesting question, with a hateful answer: it depends.

At the core, CPD takes tokens from a "lexer source", and hashes the contents on a rolling window of the requested length via the --minimum-tokens argument. The general matching algorithm is implemented on MatchAlgorithm.

However, this "lexer source" is provided by each language module, and the particulars may vary.

The basic lexer (AnyTokenizer) just tokenizes the text of the analyzed file with no knowledge of the grammar itself. No language officially supported by PMD uses this, but it can be used to analyze any text file normalizing whitespace.

Officially supported languages use actual lexers for the language, whic…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by adangel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #767 on January 15, 2021 09:40.