Mining Repositories

Mining software repositories is related to both data mining and reverse engineering. It is focused on mining software artefacts such as code bases, program states and structural entities for useful information related to the characteristics of a system [Xie et al. 2007]. In this lab you will mine software repositories and identify frequently touched files 

Material and Tools


The following tasks will guide you to carryout the labs of mining software repositories. In the lecture note, we have looked at three ways how we can get data from GitHub: cloning the repos using Git, using GHTorrent and using the GitHub API. In this lab we shall go though getting GitHub data using the GitHub API.

Task 1

Create a GitHub token(s) by following the tutorial on this link. Each token correpsonds to a GitHub account. Download the the Python file that collects files from a repo on GitHub and run it of your machine. This file collects all the files in a repo and also the number of counts the file is touched throughout its life time.

The graphs for the number of touches on the files in the following GitHub repos can be found here:

You can find the data for the graphs above in the csv folder on GitHub.

A repository contains both source files and other files like configuration files. Developers spend most of the time changing source files for many reasons, for example, fixing bugs, extending them with new features or refactoring. The script collects all files in a repository. Your first task is to adapt the script to collect only the source files. First browse though the source files of each of the repo to identify the kind of source files it contains (some repos could be written in more than one programming language).

Task 2

First fork my repo johnxu21/sre. Write a script with the name <'your_firstname'> that collects the authors and the dates when they touched for each file in the list of files generated by the adapted file (only source files).

Second, write a script that generates a scatterplot (using matplotlib) of date vs file variables where the points are shaded according to author variable. Each athour should have a distinct color. Looking at the scatter plot one should be able to tell a file that is touched many times and by who? This can help, for example, when identifying refactoring oppotunities, which developer should be allocated the task since they have touched a file many times. Since the files in some repos can be very many, you can create a scatterplot for the top 50 files with many touches. It would also be wise to code the files since they have very long names making the plot look very funny. You could for example use f1, f2, f3 .... ,f50.

You can name the stript for drowing the histogram <'your_firstname'>. You get a hint on how draw the scatter plot this link Stackoverflow

Hypothetical example

Let say a repo has six source files (file1, file2, file3, file4, file5, and file6). It has three authors (red, blue and green). The authors have been updating the files and commiting their changes. This scatterplot shows the authors activities over time.

Task 3 (Optional)

Write a script that will extract the merged pull requests and the closed and not merged pull requests from each of the repositories above. You can read about the structure of the pull requests from here Pull Reuests. Draw one graph containing all the five repos above which compares merged pull requests and closed (but not marged merged).


Send me a pull request after for Task 2 and 3