Machine Learning for Big Data: Integrated, Collaborative, Multi-Technological Solutions to Multi-Objective Problems
Authors: Dr. Sofia Vallecorsa (CERN)
BP
Abstract: The topic of this BoF is to explore the growing interaction between Machine Learning and Big Data. In particular, it will focus on how integrated collaborative platforms, multi-architectures frameworks, and cloud computing enable the solution of multi-objective problems (i.e. multivariate regression, sequence and multitask learning, multi-label classification).
The session is intended to bring together data scientists and researchers interested in the application of Big Data technology for Machine Learning in their respective fields but also technology experts, in order to highlight innovative strategies to boost collaboration and to solve more complex problems. The chosen format is a session of talks.
Long Description: As techniques to efficiently collect massive data at high rate are being refined and enhanced, data analytics cannot rely anymore on traditional tools. We are faced with the ever growing need to select what is useful to find the famous “needle in a stack” while extracting higher potential from the rich collected data. As Machine Learning is all about learning from experience in order to derive insights and predictions and deal with unfamiliar features, it is a natural evolution to consider its capacities to assist Big Data analytics. The more data available, the more effective the learning, the more complicated the problems that can be addressed.
Multi-label classification, multivariate regression, sequence learning, structured output prediction, preference learning, multi-task learning are being used more and more extensively in different fields from science to social sciences to industry: physics, biology, medicine and diagnostics, drug discovery, document categorisation, natural language processing, marketing.. For example, Machine Learning can be used to predict how the collision of particles looks like in a detector, at a high energy collider like the Large Hadron Collider at CERN. It can be used to optimise reconstruction and imaging of very detailed sky maps recorded by the future Squared Kilometer Array radio telescope.
The availability of HPC multi-architecture frameworks (spacing from large multi-core systems to hardware accelerators like GPUs and FPGAs) coupled with a variety of memory technologies allow now to address better the complex and massive needs in term of storage, workflow and efficiency. An efficient way to share models and fast access to data will boost the performance of the ML approach and accelerate the pace of results production.
This session creates a framework for discussion between data scientists, experts from different domains and technology experts to share their experience and find common solutions to similar problems. Machine learning software that is not platform-dependent and tools, like containers, can be used to provide homogeneous software environments across the different systems. The availability of interactive frameworks to allow for rapid prototype development and testing and to ease the connection between the description of models and the data are key, along with straightforward means of visualization. Machine Learning as a service on public cloud, parallel data processing platforms, efficient data access patterns, a non-exclusive list of potential topics.
A summary of this session will be collected in a sort of mini white paper to summarise the different experiences, needs and solutions and move a step further in strengthening the collaboration among different communities to foster a more multi-disciplinary approach.
This BOF will consists in a series of talks from
CERN
SKA
Alan Turing Institute
(more to be announced)
Conference Presentation: pdf
Birds of a Feather Index