Mining Software Repositories

Many companies, as well as open-source projects, use Version Control Systems (VCS) like CVS, SubVersionClearCase, and Git to record the history of source-code files. These VCS record the whole development and the corresponding change activities applied on the source code. Furthermore, these versioning systems are often associated with change management systems which allow the users to report bugs and request new features. Popular change management systems include repositories like Bugzilla and JIRA. These systems record the entire evolution of a software system. 

The highly active Mining Software Repositories (see MSR conference) community shows that the evolutionary information implicitly stored in these systems may provide valuable insight into how the software is being developed. The evolution of a source-code file over time may explain better why previous design choices have been made in the file. The first step to analyze the evolutionary information of a software system consists of extracting the necessary information from the version control system or change management system.

To explore some interesting work that has been explored in the MSR field, you can refer to this link. Some examples of interesting studies include:

This lab session serves as the first contact with this newly arising domain. As such you will learn the kind of knowledge that can be extracted from a software repository. In order to gain this understanding, we will be using an initial prototype. 

Case Study: JabRef

For this exercise, we will use JabRef. Open the shell and clone the project. Use git log to explore the project's evolution. 

Query the Versioning System

First, we will use simple queries applied to the version control system to identify interesting knowledge about the software system at hand. Study the information provided by the extracted Git log.

  • Can you recognize any reference to the project JabRef?
  • What valuable information can you extract from this Git log file about the evolution of the software system?
  • (How) can you use the information implicitly stored in the Git log to tackle the following problems:
      • Read all the code in one hour
      • Chat with the maintainers
      • Study the exceptional entities

Using git log and grep we can extract information such as:

  • Number of commits of Oliver Kopp
  • Number of times that the file has been modified

First Contact Visualizations

The previous analysis highlights that the repository may provide relevant information related to the evolution of the system.  However, writing queries to understand a software system might be quite daunting especially during your first contact with a system. Hence we will introduce a simple visualization of the versioning system: EGIT.

Launch Eclipse. EGIT is an Eclipse's plugin create to support Git repository. Follow the manual and answer the following questions:

  • Which files were recently changed by Sascha Zeller?
  • Which developers can you contact for more information about the class?
  • Find the last committed version of file, and discover:
    • which files have been modified along with it
    • who is the author of the last change
    • does this file exists in the current version of the source
  • Can you identify which files are modified to fix bug #1153? 
  • Can you identify the source code files which are unstable, i.e.: files which have been subject to many changes?

EGIT is only one of the tools able to presents in a convenient view the information reported in software repository. There are many others - often developed by the MSR community (here an update list)- that explore the software evolution. Each one is characterized by the repository handled (e.g. git) and information extracted. Here are a few examples:

  • CodeScene. This tool integration with GitHub provides metrics and details useful for repository analysis. 
  • Gource. The tool is able to visualize the evolution of the repository
  • CVSGrab. The tool supports only the CVS repository (the commercial version extend this support). For details refer to the manual.
  • Kumpel. It supports SVN but requires the MOOSE infrastructure. 
  • CVSAnalY. The data extracted from the repository is stored in a db. This data can be queried easily according to the analysis we want to perform. For details refer to the manual.
  • Softwarenaut. Using the function "filmstip" it is possible to visualize how the packages of the system evolve.

We are going to use CodeScene and/or Gsource to visualize JabRef evolution. More specifically, which developers worked on which files.

  • In CodeScene, import the JabRef project and look at the Social Networks -> Individuals visualization. Click on any file to see all the authors that contributed to its source.
  • In Gource, we can use the command line using the "gource" command with the project root folder as its argument.