Position Paper: Experiences on Clustering High-Dimensional Data Using pbdR
Workshop: The 2017 International Workshop on Software Engineering for High Performance Computing in Computational and Data-Enabled Science and Engineering (SE-CoDeSE 2017)
Abstract: Motivation: Software engineering for HPC environments, in general, and for big data, in particular, faces a set of unique challenges including the high complexity of middleware and of computing environments. Tools that make it easier for scientists to use HPC are, therefore, of paramount importance. We provide an experience report of using one of such highly effective middleware pbdR that allow the scientist to use R programming language without, at least nominally, having to master many layers of HPC infrastructure, such as OpenMPI and ScaLAPACK.
Objective: To evaluate the extent to which middleware helps improve scientist productivity, we use pbdR to solve a real problem that we, as scientists, are investigating. Our big data comes from the commits on GitHub and other project hosting sites, and we are trying to cluster developers based on the text of these commit messages.
Context: We need to be able to identify the developer for every commit and to identify commits for a single developer. Developer identifiers in the commits, such as login, email, and name are often spelled in multiple ways since that information may come from different version control systems and may depend on which computer is used.
Method: We train a Doc2Vec model where existing credentials are used as a document identifier and then use the resulting 200-dimensional vectors for the 2.3M identifiers to cluster these identifiers so that each cluster represents a specific individual.