2. Duplicated Code

(Last Update: 2022-03-09)

The purpose of this lab session (and the exercises on it) is to present tools for Duplicate Code detection. The removal or not of duplicated code is a decision for an expert. Although "Detecting Duplicated Code" has its own chapter in the book (OORP, p.223), I like to point out the Setting Direction chapter (OORP, p.19) as well. When dealing with duplicate code we often need to deal with the patterns "Fix Problems, Not Symptoms" (OORP, p.33), and "If It Ain't Broke, Don't Fixed it" (OORP, p.35). Duplicate code can be the symptom of the programming practices adopted by the developers, therefore, it might be better to address that before trying to remove the duplicates. Moreover, a duplicated code that is stable and not prone to change, may not need to be removed (or maybe its costs for removal is bigger than to just let it be).

When is code duplication okay? If duplicating something hits a deadline and not duplicating doesn’t, then I would rather deliver today and fix the tech debt tomorrow. One development paradigm where code duplication is prefered over abstraction is clone-and-own.

In clone-and-own development, a new variant of a software system is created by copying and adapting an existing variant (e.g., using the branching or forking capabilities of a version control system). This way, new variants are created ad-hoc and without requiring upfront investments. However, with an increasing number of variants, development becomes redundant and maintenance efforts rapidly grow. For example, if a bug is discovered and fixed in one variant, it is often unclear which other variants in the family are affected by the same bug and how this bug should be fixed in these variants. A more systematic way of developing variants is through Software Product Lines, which consists of a set of similar software products with well-defined commonalities and variabilities. The software product lines strategy easily scales with many variants, but is often difficult to adopt as it requires a large upfront investment of time and money. However, developers frequently create similar software products employing ad-hoc reuse of clone-and-own instead, because of its inexpensiveness, flexibility, and developer independence. On social coding platforms like GitHub clone-and-own is prevelance mainline because of its inexpensiveness, flexibility, and developer independence.

Materials & Tools Used for this Lab Session

slides here.
(Video lecture here, for last year - Only relevant if you want to attempt the Tasks 3 to 6.
PaReco - is a lightweight syntax-based code clone detection tool that identifies unpatched code clones (missed opportunity in our case) at scale. The tool then performs normalization (i.e., removes language comments, removes all non ASCII characters, removes redundant whitespaces except new lines, and converts all characters to lower case). Next the tool performs tokenization on the extracted files from both the source and the target variants. Representing a source code as a token sequence enables the detection of clones with different line structures, which cannot be detected by the line-by-line algorithm.

Auxiliary Tools

Auxiliary tools are not required for the lab session itself, but they may be useful to get additional information (or alternatives) on a project. Use them at your own discretion.

DuplicationSuspect.java - file used as an example for the Tasks 1-2
FreeMercartor (Java, Point-of-Sales Application, website) sample system used for Tasks 3-4
Dude (website) a clone detection tool that employs line similarity instead of tokens.
iClones + rcfViewer: iClones is a powerful tool for detecting clones and rfcViewer provides a visualization interface to browse the clones. To download iClones you need to submit a request in their webpage to get the download link (under an academic/personal license, not for use in commercial practices). There is a compressed file inside JPacman repository called 'clonedetection-02.tar.bz2' which is the current version of iClones.

rcfviewer.bat - for Windows users, there is a small issue on the 'rcfviewer.bat' file on the official download. Replace it with this one to make it work.

simpleDude.pl - perl script with a simple duplicate code detection algorithm (similar to the one showed in the introduction lecture for this course)
NiCad employs a text-based technique for clone detection (usually faster than token-based but also catches fewer clones).
cyclone is an evolution inspection tool for clones and a part of the iClones tool suite (but you must download cyclone separately). It can analyze the evolution of clones across several versions of a software system. Read the documentation of both iClone and cyclone to use them to study the evolution of clones. You may need to download several versions of the same software (from GitHub, SourceForge, etc.) and prepare the source code according to the instructions of iClones for an evolutionary comparison.
Duplicate Detector is an IntelliJ plugin to detect clones (not as powerful as iClones, and it may crash sometimes, but convenient)
PMD's Copy and Paste Detector (CPD) is an Eclipse plugin to detect clones.

Setup / Preparation

Be sure to follow the setup from the previous session if you had not done so (summarizing it would be: (i) install IntelliJ, (ii) fork JPacman repository, (iii) build & run JPacman). Also if have not already, download the book for this course "Object-Oriented Reengineering Patterns" (Note: OORP, p.xx refers to a page in the pdf version of this book)

Get/Install the tools and codes detailed in the Materials above.

Task 1: Manual Small Scale Clone Detection

We start with the ultimate clone detector: the developer. Look at the patches in the following file patches.

LinkedIn is a variant of Apache Kafka that was created by copying and adapting the existing code of Apache Kafka that was forked on 2011-08-15T18:06:16Z. The two software systems kept on synchronizing their new updates until 2022-02-22T13:32:39Z. Since 2022-02-22T13:32:39Z (divergence date), the two projects do not share common commits yet actively evolve in parallel. Currently (2022-02-28T15:01:39Z), LinkedIn has 500 individual commits, and Apache Kafka has 3,103 individual commits.
Select one of the patches, say pull request 9305 from the upstream repository. Then look the same files in the fork variant. Look for the duplicated code.
Repeat the same for other patches and determine if the patch is classfied correctly by the tool PaReco

Can you detect the MO/ED of all patches with your bare eyes?
Which methods seem to be similar?

Manual clone detection does not scale very well. We will, therefore, use some tools to do the tedious comparisons.

Please follow the instructions on the readme.md file of PaReco.

Task 2: Large Scale Clone Detection

We shall now use our tool to detect patches and missed patches between two variants.

The following tasks are from last year. They are optional

Task 3: Automated Small Scale Clone Detection

Let's use an automated tool for duplicate detection, Dude. This tool uses line similarity to find clones. Now, we will use Dude to detect clones on the DuplicationSuspect.java file (the Materials above have the files and links).

Download and execute Dude. Since it is a java executable, Dude should run on any platform with a JVM. On the main interface, click on the "Home/House" icon (or menu 'Search' -> 'Set starting directory') and select the folder with DuplicationSuspect.java. Make sure there are no other files in this folder otherwise Dude will also try to detect clones on them. Then click on the 'Search' icon (or menu 'Search' -> 'Search').

Try to change the parameters of DuDe (menu 'Search' -> 'Configure Parameters') and see how that affects the detected clones.

Did the tool detect more or fewer duplicates than you?

How is it different to set the parameters for exact matching (the default configuration) compared to other similarities?

Can you determine any refactoring candidates based on this tool output?

Task 4: Large Scale Clone Detection

Let's try now a more sophisticated tool. In this case, instead of exact line matching, the similarity of token chains will be used for determining clones. We will use iClones, from the University of Bremen. We will analyze the FreeMercator system with it (JPacman is too small, and does not have too many clones).

Then, launch iClones to analyze FreeMercator and generate a report of duplicates. Be aware that iClones is a command-line tool, and you should adjust the paths accordingly. Specifically, <source_path> is the path where you have the source codes for your project, and <iclones_path> is the path where you stored the unpacked folder of iClones (and so forth). If you need help on using command-line for this tool, watch the extra lecture on it.

First, make sure to make your current folder where FreeMercator is located

cd <source_path>

Then, execute the command below to run iClones on the current folder (which you set above to be the source path). On Windows use "iclones.bat" instead and replace "/" with "\".

<iclones_path>/iclones -input . -output clonereport.rcf -outformat rcf

That will create a file called 'clonereport.rcf' which contains all the clones detected by iClones. To see these clones in a more user-friendly interface, we need to use the rcfViewer. Now, launch rcfViewer to visualize the report (you will need to change rcfviewer.sh to an executable on Mac/Linux). On Windows, the command is "rcfviewer.bat" instead (remember to replace the rcfviewer.bat with my bug free version).

<rcfviewer_path>/rcfviewer.sh clonereport.rcf

Questions about the report view:

What information do you have in the clone report viewer? What different views?

How many clone classes are reported?

What types of clones does the tool detect?

How many clone pairs does the first clone set comprise?

Questions about the source code view:

Can you determine any refactoring candidates from the source code view?

What types of clones are more clear, and easy, candidates for refactoring?

Task 5: Tweaking the Parameters on Clone Detection

Now, with a bit of experience let's play with the different options a little. Run iClones without arguments to read the help documentation. We are mainly interested in two options:

minblock: Minimum length of identical token sequences that are used to merge near-miss clones. (Default: 20)
minclone: Minimum length of clones measured in tokens. (Default: 100)

Try for example this (adapting the command line accordingly to run on your sources, path, and platform):

	iclones -minblock 0 -minclone 100 -input . -output report2.rcf -outformat rcf

As you can see, if minblock is set to 0, only identical clones are detected. This time, adjust minblock and minclone to remove false positives (you will have to create a new report with iClones and open it again with rcfviewer).

Which options seem to remove more false positives?

You are on the lookout for clones, which can be easily refactored. Which option values seem to lead to these clones?

Task 6: Clone Detection on JPacman

Now, run the clone detection on JPacman. Adjust the parameters as necessary.

Was it important to adjust the parameters to find clones on JPacman?

How many clones does JPacman have compared to FreeMercator?

Is it necessary to remove all clones you found in JPacman?

Optional Task 1: Comparing Dude and iClones

This is an optional task, for those who want to dive deeper into Clone Detection. Compute again the duplication of FreeMercator, this time with Dude. It uses line similarity instead of token chains.

Comparing iClones with Dude: which tool finds the most clones?

Comparing iClones with Dude: Which tool seems the most useful?

Optional Task 2: Clone Detection on other Large Systems

This is another optional task. We have some large systems with a large number of clones for you to play with. For the systems coded in C, you need to use the '-language c++' parameter in iClones.

MegaMek GitHub (Java, Strategy Game, website)
PostgreSQL Git Info (C, Relational DB, website)
Quake 3 Arena GitHub (C, FPS, website)

Discussion and Conclusions

In this session, we used Duplicate Code Detection tools to find Code Clones. Be sure to check the chapter "Detecting Duplicated Code" (OORP, p.223) for more information. As a final discussion let's consider the following questions.

Detection Quality:

How many false positives (clones which you as a programmer would not name as such) have the detector found?
Where the options offered by the detector enough to get to the true positives?
Did you feel that the options also removed some true positives?

Reengineering:

What were the reasons you could not refactor some clones?
Could a tool detect the characteristics that are detrimental to the refactorability of clones?

Tool Support:

What are the shortcomings of the tools you used?
What feature do you miss most?

Duplication Awareness:

If you look at your own code, would you find clones?
Will you pay more attention to duplication in your own programming in the future?