Mining software repositories is related to both data mining and reverse engineering. It is focused on mining software artefacts such as code bases, program states and structural entities for useful information related to the characteristics of a system [Xie et al. 2007]. In this lab you will mine software repositories and identify frequently touched files
Material and Tools
The following tasks will guide you to carryout the labs of mining software repositories. In the lecture slides, we have looked at three ways how we can get data from GitHub: cloning the repos using Git, using GHTorrent and using the GitHub API. In this lab we shall go though getting GitHub data using the GitHub API.
Task 1
The stats for the number of touches on the files of the GitHub repos can be found here:
Task 2
A repository contains both source files and other files like configuration files. Developers spend most of the time changing source files for many reasons, for example, fixing bugs, extending them with new features or refactoring. The script CollectFiles.py collects all files in a repository. Your first task is to adapt the script to collect only the source files. You can find a repo's programming languages on the bottom right of the repo's page on GitHub (some repos could be written in more than one programming language).
Example (scottyab/rootbeer)
The repository scottyab/rootbeer has a total of 17 unique source files ('.java'). It has a total of 33 authors who have touched the 17 unique files (the data points in the graph) who have been updating the files and commiting their changes. The scatterplot below shows the authors activities over time for the repository scottyab/rootbeer.
Task 3 (Optional)
Write a script that will extract the merged pull requests and the closed (but not merged) from each of the repositories above. You can read about the structure of the pull requests from here Pull Reuests. Draw one graph containing all the five repos above which compares merged pull requests and closed (but not merged).
Push your updated changes on your local repo to your fork online