Duplicated Code Session

Objectives

The students know
- why developers duplicate code,
- the problems that can result from duplicating code,
- that duplicating code can sometimes be justified,
- that there exist tools to dectect and visualize code cloning in software systems,
- and the advantages and disadvantages of exact string matching and token based comparison algorithms.
The students are able to
- detect and analyse code clones in software systems using iClones,
- fine-tune the clone detection options to obtain more accurate results,
- select code clones that are refactoring candidates,
- and combine code clones into one function/method to remove them.
The students are aware of problems with the application of the tools and shortcomings that are typical for solutions in a relatively young research domain.

Tasks

Task 1: Introduction

Listen to the introduction of the assistent. See in the Documents section for the slides.

Task 2: Small scale detection

We start with the ultimate clone detector: the programmer.

Exercise:

Look at the class DuplicationSuspect.java in an editor. See in the Documents section for the file.

Questions:

Can you detect duplication with your bare eyes?
Which methods seem to be similar?

Manual clone detection does not scale very well. We will therefore use some tools to do the tedious comparisons.

Exercise:

Use the simpleDude.pl script on the DuplicationSuspect.java file. See in the Documents section for the script. (Or run launchSimpleDude.bat in the bundle. Remember to make simpleDude.pl executable)

	simpleDude.pl DuplicationSuspect.java > report.txt

Try to change the parameter $slidingWindowSize at the beginning of the simpleDude.pl script from 10 to something higher, e.g. 20, 30 ...

Questions:

Did the tool detect more/less duplication than you?
What are the problems with this way of reporting duplication?
The detector uses exact string matching as a comparison mechanism. What are the consequences of that?

Can you determine any refactoring candidates?

Task 3: Large scale detection

Lets try now a more sofisticated tool. In this case, instead of exact line matching, similarity of token chains will be used for determining clones. We will use iClones, from the University of Bremen. We will analyse a real world system with it. In the lab directory there are four systems that can be investigated: FreeMercator, MegaMek,PostgreSQL and Quake3. See in the Documents section for the source code archives. But if you want to investigate a project of your own, it is even better.

First, we do the duplication analysis of FreeMarcator together. Unpack the archive.

Then, launch iClones to analyse a software system and generate a report of duplicates. (path source is the one that hosts the project

cd <path to source>;

/opt/unibremen[...]/iclones.sh  (continue on the next line)

-input . -informat single -output clonereport.rcf -outformat rcf

Now, launch rcfViewer to read the report.

/opt/unibremen[...]/rcfviewer.sh clonereport.rcf

Exercise:

Compute the duplication of FreeMercator. Start with the default options. Look at the identified clones using rcfviewer.

Questions about the report view:

What information do you have in the clone report viewer? What different views?
How many clone classes are reported?
What types of clones does the tool detect?
How many clone pairs does the first clone set comprise?

Questions about the source code view:

Can you determine any refactoring candidates from the source code view?
What types of clones are more clear, and easy, candidates for refactoring?

Now, with a bit of experience let's play with the different options a little. Run iclones without arguments to read the help documentation. We are mainly interested in two options:

minblock: Minimum length of identical token sequences that are used to merge near-miss clones. (Default: 20)
minclone: Minimum length of clones measured in tokens. (Default: 100)

Try for example this:

	iclones -minblock 0 -input . -informat single -output clonereport.rcf -outformat rcf

	iclones -minblock 10 -input . -informat single -output clonereport.rcf -outformat rcf

	iclones -minblock 50 -input . -informat single -output clonereport.rcf -outformat rcf

As you can see, if minblock is set to 0, only identical clones are detected.

Exercise:

Compute again the duplication of FreeMercator with iClones. This time, adjust minblock and minclone to remove false positives (you will have to create a new report with iclones and open it again with rcfviewer).

Questions:

Which options seems to remove false positives best?
You are on the lookout for clones, which can be easily refactored. Which option values seems to lead to these clones?

Exercise:

Compute again the duplication of FreeMercator, this time with Dude. It uses line similarity instead of token chains.

Questions:

Comparing iClones with Dude: which tool finds the most clones?
Comparing iClones with Dude: Which tool seems the most usefull?

Task 4: Refactoring

To make this a reengineering lab, we will refactor some duplication.

Exercise:

If you were able to improve the clone detection using your own settings, use those, if not, go back to study the report generated with the default options. Some of the clones that were found should be chosen and removed, i.e. combined into a single function/method.

Questions to guide the refactoring process:

Which groups of files that are obviously interconnected with each other can you see?
Can you find a file, which is (almost) a complete copy of another?
Once you have detected the code clones and you want to look for refactoring candidates, what is the natural thing to look for when looking at the clones?

How much of the code belongs to the clone? Only the part that is actually copied?
Which parts of the common code need to be abstracted to make the code work in general? How many parameters do we need to pass to the extracted functionality?
To which class does the functionality belong? Are the original places related via inheritance relationships that we can exploit (move to superclass)? Do we need to create a new class?
How sure are you that your refactoring did not change the behaviour of the system?

Task 5: Final Discussion

Discuss in class about the following questions.

Detection Quality:

How many false positives (clones which you as a programmer would not name as such) has the detector found?
Where the options offered by the detector enough to get to the true positives?
Did you feel that the options also removed some true positives?

Reengineering:

What were the reasons you could not refactor some clones?
Could a tool detect the characteristics that are detrimental to the refactorability of clones?

Tool Support:

What are shortcomings of the tools you used?
What feature do you miss most?

Duplication Awareness:

If you looked at your own code: have you found any striking examples?
Will you pay more attention to duplication in your own programming in the future?

Bonus task: Further exploration

Explore other tools:

Dude (website): it uses line similarity instead of token chains.
PMD's Cut and Paste Detector: it is an Eclipse plugin so, it's integrated in your environment.
iClones + cyclone: it can also analyse the evolution of clones across several versions of a software system. Read the documentation of iClones and cyclone and use them to study the evolution of clones. You have to download several versions of the same software system (from github, sourceforge, etc.) and prepare the source code following the intructions of iClones.

Read the additional documentation to improve your knowledge on software clones, to inspire your imagination and to identify more clone management tools. Different tools are more appropiate than others depending on several factors such as the type of software project you want to analyze, the type of clones you need to detect more accurately, etc.

Documents

The slides of the code duplication introduction.
The list of files/systems under examination:
- DuplicationSuspect.java
- FreeMercator (Java, POS application, website)
- MegaMek (Java, Strategy Game, website)
- PostgreSQL (C, Relational DB, website)
- Quake 3 (C, FPS, website)
The list of scripts/tools for code duplication detection used in the exercises:

Further reading materials:

C. K. Roy and J. R. Cordy. A survey on software clone detection research. Technical Report TR 2007-541, School of Computing, Queens University, 2007.
C. K. Roy, J. R. Cordy, and R. Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming, 74(7):470 – 495, 2009. Special Issue on Program Comprehension (ICPC 2008).
C. Kapser and M. W. Godfrey. “Cloning Considered Harmful” Considered Harmful. In WCRE ’06: Proceedings of the 13th Working Conference on Reverse Engineering, pages 19–28, Washington, DC, USA, 2006. IEEE Computer Society
C. K. Roy, M. F. Zibran, and R. Koschke. The vision of software clone management: Past, present, and future. In Proceedings of the Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE), 2014.