Author/Presenters
Event Type
Workshop
Deep Learning
Machine Learning
SIGHPC Workshop
TimeMonday, November 13th4:42pm -
5:06pm
Location502-503-504
DescriptionThe k-means clustering method is one of the most widely
used techniques in big data analytics. In this paper, we
explore the ideas of software blocking, asynchronous
local optimizations, and heuristics of simulated
annealing to improve the performance of k-means
clustering. Like most of the machine learning methods,
the performance of k-means clustering relies on two main
factors: the computing speed (per iteration), and the
convergence rate. A straightforward realization of the
software-blocking synchronization-reducing clustering
algorithm, however, sees sporadic slower convergence
rate than the standard k-means algorithm. To tackle the
issues, we design an annealing-enhanced algorithm, which
introduces the heuristics of stop conditions and
annealing steps to provide as good or better performance
than the standard k-means algorithm. This new enhanced
k-means clustering algorithm is able to offer the same
clustering quality as the standard k-means. Experiments
with real-world datasets show that the new parallel
implementation is faster than the open source HPC
library of Parallel K-Means Data Clustering (e.g., 19%
faster on relatively large datasets with 32 CPU cores,
and 11% faster on a large dataset with 1,024 CPU cores).
Moreover, the extent to which the program performance
improves is largely determined by the actual convergence
rate of applying the algorithm to different
datasets.