Mining Software Repositories

Many companies, as well as open-source projects, use Version Control Systems (VCS) like CVS, SubVersionClearCase and Git to record the history of source-code files. These VCS record the whole development and the corresponding change activities applied on the source code. Furthermore, these versioning systems are often associated with change management systems which allow the users to report bugs and request new features. Popular change management systems include repositories like Bugzilla and JIRA. These systems  record the entire evolution of a software system. 

The highly active Mining Software Repositories (see MSR conference) community shows that the evolutionary information implicetly stored in these systems may provide valuable insight into how the software is being developed. The evolution of a source-code file over time may explain better why previous design choices have been made in the file. The first step to analyse the evolutionary information of a software system consists of extracting the necessary information from the version control system or change management system.

To explore some interesting work that has been explored in the MSR field, you can refer to this link. Some examples of interesting studies include:

This lab session serves as a first contact with this newly arrising domain. As such you will learn the kind of knowledge that can be extracted from a software repository. In order to gain this understanding, we will be using an initial prototype. 

Case Study: JabRef

For this exercise we will use JabRef. Open the shell and clone the project. Use git log to explore the project's evolution. 

Query the Versioning System

First, we will use simple queries applied on the version control system to identify interesting knowledge about the software system at hand. Study the information provided by the extracted Git log (download the file here: Git.log)

  • Can you recognize any reference to the project JabRef?
  • What valuable information can you extract from this Git log file about the evolution of the software system?
  • (How) can you use the information implicitly stored in the Git log to tackle following problems:
      • Read all the code in one hour
      • Chat with the maintainers
      • Study the exceptional entities

Using git log and grep we can extract several information:

  • Number of commits of Oliver Kopp
  • Number of times that the file has been modified

First Contact Visualizations

Previous analysis highlights that the repository may provide relevant information related to the evolution of the system.  However, writing queries to understand a software system might be quite daunting especially during your first contact with a system. Hence we will introduce a simple visualization of the versioning system: EGIT.

Launch Eclipse. EGIT is a Eclipse's plugin create to support Git repository. Follow the manual and answer the following questions:

  • Which files were recently changed by Kolija Brix?
  • Which developers can you contact for more information about the class?
  • Given the last version of file, discover:
    • which files have been modified along with it
    • who are the authors of the lines 64, 65
    • which changes have been done in from the previous version
  • Can you identify the source code files which are unstable, i.e.: files which have been subject to many changes?
  • Can you identify which files are modified to fix bug #1153?

EGIT is only one of the tools able to presents in a convenient view the information reported in software repository. There are many others - often developed by the MSR community (here an update list)- that explore the software evolution. Each one is characterezed by the the repository handled (e.g. git) and information extracted. Here few examples:

  • Softwarenaut. Using the function "filmstip" it is possible to visualize how the packages of the system evolve.
  • Gsource. The tool is able to visualize the evolution of the repository
  • CVSGrab. The tool supports only the CVS repository (the commercial version extend this support). For details refer to the manual.
  • Kumpel. It supports SVN but requires the MOOSE infrastructure. 
  • CVSAnalY. The data extracted from the repository is stored in a db. This data can be queried easily according to the analysis we want to perform. For details refer to the manual.

We will use gource to visualize the evolution of a project. More specifically, which developers worked on which files and when. It can be simply run from the command line using the "gource" command with the project root folder as its argument.

  • Use gource on a group-project you made for the university, the only precondition is that you used a version control system (git or mercurial).
  • Alternately, if you don't have an interesting project, use gource on the JabRef project.