4. Mining Repositories

Mining software repositories is related to both data mining and reverse engineering. It is focused on mining software artefacts such as code bases, program states and structural entities for useful information related to the characteristics of a system [Xie et al. 2007]. In this lab you will mine software repositories and identify frequently touched files 

Material and Tools


The following tasks will guide you to carryout the labs of mining software repositories. In the lecture slides, we have looked at three ways how we can get data from GitHub: cloning the repos using Git, using GHTorrent and using the GitHub API. In this lab we shall go though getting GitHub data using the GitHub API.

Task 1

  • Create a GitHub token(s) by following the tutorial on the link. Each token correpsonds to a GitHub account.
  • Fork the my repo johnxu21/sre2020-21. Thereafter clone the fork onto your laptop to have a local copy of the source code. 
  • Browse the src and rename the file CollectFiles.py to <your-names>_CollectFiles.py.
  • Replace the fake tokens in the code with your own token.
  • Thereafter, run the file <your-names>_CollectFiles.py and look at the output. The code collects all the files in a repo and also the number of counts the file is touched throughout its life time.

The stats for the number of touches on the files of the GitHub repos can be found here:

Task 2

A repository contains both source files and other files like configuration files. Developers spend most of the time changing source files for many reasons, for example, fixing bugs, extending them with new features or refactoring. The script CollectFiles.py collects all files in a repository. Your first task is to adapt the script to collect only the source files. You can find a repo's programming languages on the bottom right of the repo's page on GitHub  (some repos could be written in more than one programming language).

  • First, write a script with the name <'your_firstname'_authorsFileTouches.py> that collects the authors and the dates when they touched for each file in the list of files generated by the adapted file CollectFiles.py (only source files).
  • Second, write a script that generates a scatterplot (using matplotlib) of weeks vs file variables where the points are shaded according to author variable. Each author should have a distinct color. Looking at the scatter plot one should be able to tell a file that is touched many times and by who? This can help, for example, when identifying refactoring oppotunities, which developer should be allocated the task since they have touched a file many times. You can name the stript for drowing the histogram <'your_firstname'_scatterplot.py>. You get a hint on how draw the scatter plot this link Stackoverflow

Example (scottyab/rootbeer)

The repository scottyab/rootbeer has a total of 17 unique source files ('.java'). It has a total of 33 authors who have touched the 17 unique files (the data points in the graph) who have been updating the files and commiting their changes. The scatterplot  below  shows the authors activities over time for the repository scottyab/rootbeer.

Task 3 (Optional)

Write a script that will extract the merged pull requests and the closed (but not merged) from each of the repositories above. You can read about the structure of the pull requests from here Pull Reuests. Draw one graph containing all the five repos above which compares merged pull requests and closed (but not merged). 

Push your updated changes on your local repo to your fork online