Duplicate Code Lab Session

(Last Update: 2020-02-26)

The purpose of this lab session (and the exercises on it) is to present tools for Duplicate Code detection. The removal or not of duplicated code is a decision for an expert. Although "Detecting Duplicated Code" has its own chapter in the book (OORP, p.223), I like to point out the Setting Direction chapter (OORP, p.19) as well. When dealing with duplicate code we often need to deal with the patterns "Fix Problems, Not Symptoms" (OORP, p.33), and "If It Ain't Broke, Don't Fixed it" (OORP, p.35). Duplicate code can be the symptom of the programming practices adopted by the developers, therefore, it might be better to address that before trying to remove the duplicates. Moreover, a duplicated code that is stable and not prone to change, may not need to be removed (or maybe its costs for removal is bigger than to just let it be). 

Materials & Tools Used for this Session

  • Session slides here.
  • (Same as the previous session) IntelliJ IDE (you can use Eclipse at your discretion, but it may require some adaptations for the project we are using during the lab sessions)
  • (Same as the previous session) JPacman repository.
  • DuplicationSuspect.java - file used as an example for the Tasks 1-2
  • simpleDude.pl - perl script with a simple duplicate code detection algorithm (similar to the one showed in the first lecturer for this course)
  • FreeMercartor (Java, POS Application, website) sample system used for Tasks 3-4
  • iClonesrcfViewer: Here is a direct link to download iClone (please note this file is under an academic license, do not use for commercial practices). iClone is a powerful tool for detecting clones and rfcViewer provides a visualization interface to browse the clones.


Auxiliary Tools

Auxiliary tools are not required for the lab session itself, but they may be useful to get additional information (or alternatives) on a project. Use them at your own discretion. 

    • Dude (website) a clone detection tool that employs line similarity instead of tokens.
    • cyclone is an evolution inspection tool for clones and a part of the iClones tool suite (but you must download cyclone separately). It can analyze the evolution of clones across several versions of a software system. Read the documentation of both iClone and cyclone to use them to study the evolution of clones. You may need to download several versions of the same software (from GitHub, SourceForge, etc.) and prepare the source code according to the instructions of iClones for an evolutionary comparison.
    • Duplicate Detector is an IntelliJ plugin to detect clones (not as powerful as iClones, and it may crash sometimes, but convenient)
    • PMD's Copy and Paste Detector (CPD) is an Eclipse plugin to detect clones.


Setup / Preparation

Be sure to follow the setup from the previous session if you had not done so (summarizing it would be: (i) install IntelliJ, (ii) fork JPacman repository, (iii) build & run JPacman). Also if have not already, download the book for this course "Object-Oriented Reengineering Patterns(Note: OORP, p.xx refers to a page in the pdf version of this book)

Get/Install the tools & plugins detailed in the Materials above.


Task 1: Manual Small Scale Clone Detection

We start with the ultimate clone detector: the developer. Look at the class DuplicationSuspect.java in any editor of your choice (this file is on the Materials) 

Can you detect duplication with your bare eyes? Which methods seem to be similar?

Manual clone detection does not scale very well. We will, therefore, use some tools to do the tedious comparisons.


Task 2: Automated Small Scale Clone Detection

Use simpleDude.pl script on the DuplicationSuspect.java file (the Materials above have the script and files) Remember to make simpleDude.pl executable (and rename it from txt to pl). 
	simpleDude.pl DuplicationSuspect.java > report.txt 

Try to change the parameter $slidingWindowSize at the beginning of the simpleDude.pl script from 10 to something higher, e.g. 20, 30 ...

Did the tool detect more/less duplication than you? 
What are the problems with this way of reporting duplication? 
The detector uses exact string matching as a comparison mechanism. What are the consequences of that? 
Can you determine any refactoring candidates?

Task 3: Large Scale Clone Detection

Let's try now a more sophisticated tool. In this case, instead of exact line matching, the similarity of token chains will be used for determining clones. We will use iClones, from the University of Bremen. We will analyze the FreeMercator system with it (JPacman is too small, and does not have too much clones). 
Then, launch iClones to analyze FreeMercator and generate a report of duplicates. Be aware that iClones is a command line tool, and you should adjust the paths accordingly. Specifically, <source_path> is the path where you have the source codes for your project, and <iclones_path> is the path where you stored the unpacked folder of iClones (and so forth). On Windows change the ".sh" to ".bat" and the "/" to "\". On Mac, the ".sh" at the end may not be necessary.
cd <source_path>
<iclones_path>/iclones.sh -input . -informat single -output clonereport.rcf -outformat rcf
Now, launch rcfViewer to read the report (you may need to change rcfviewer.sh to executable on Mac/Linux).
<rcfviewer_path>/rcfviewer.sh clonereport.rcf

Questions about the report view:
What information do you have in the clone report viewer? What different views?
How many clone classes are reported?
What types of clones does the tool detect?
How many clone pairs does the first clone set comprise?
Questions about the source code view:
Can you determine any refactoring candidates from the source code view?
What types of clones are more clear, and easy, candidates for refactoring?

Task 4: Tweaking the Parameters on Clone Detection

Now, with a bit of experience let's play with the different options a little. Run iClones without arguments to read the help documentation. We are mainly interested in two options:
  • minblock: Minimum length of identical token sequences that are used to merge near-miss clones. (Default: 20)
  • minclone: Minimum length of clones measured in tokens. (Default: 100)
Try for example this (adapting the command line accordingly to run on your sources):
	iclones -minblock 0 -input . -informat single -output clonereport.rcf -outformat rcf
As you can see, if minblock is set to 0, only identical clones are detected. This time, adjust minblock and minclone to remove false positives (you will have to create a new report with iClones and open it again with rcfviewer). 
Which options seem to remove more false positives?
You are on the lookout for clones, which can be easily refactored. Which option values seem to lead to these clones?

Task 5: Clone Detection on JPacman

Now, run the clone detection on JPacman. Adjust the parameters as necessary.
Was it important to adjust the parameters to find clones on JPacman?
How many clones does JPacman have compared to FreeMercator?
Is it necessary to remove all clones you found in JPacman?

Optional Task 1: Comparing Dude and iClones

This is an optional task, for those who want to dive deeper into Clone Detection. Compute again the duplication of FreeMercator, this time with Dude (the link is in the auxiliary tools). It uses line similarity instead of token chains.
Comparing iClones with Dude: which tool finds the most clones?
Comparing iClones with Dude: Which tool seems the most useful?

Optional Task 2: Clone Detection on other Large Systems

This is another optional task. We have some large systems with a large number of clones for you to play with.
Discussion and Conclusions

In this session, we used Duplicate Code Detection tools to find Code Clones. Be sure to check the chapter "Detecting Duplicated Code" (OORP, p.223) for more information. As a final discussion let's consider the following questions.

Detection Quality:
  • How many false positives (clones which you as a programmer would not name as such) has the detector found?
  • Where the options offered by the detector enough to get to the true positives?
  • Did you feel that the options also removed some true positives?


  • What were the reasons you could not refactor some clones?
  • Could a tool detect the characteristics that are detrimental to the refactorability of clones?
Tool Support:
  • What are shortcomings of the tools you used?
  • What feature do you miss most?

Duplication Awareness:

  • If you look at your own code, would you find clones?
  • Will you pay more attention to duplication in your own programming in the future?