Optimisation of experiment workflows in the Worldwide LHC Computing Grid

Project name

Project description

The LHC experiments perform the vast majority of the data processing and analysis on the Worldwide LHC Computing Grid (WLCG), which provides a globally distributed infrastructure with more than 500k cores to analyse the tens of PB of data collected each year. Profiling of the computing infrastructure with respect to the impact of different workloads is a crucial factor to find the most efficient match between resources and use cases. From the current analysis it is clear that the efficiency is neither perfect nor well understood.

There is a rich amount of information collected by the communities' monitoring infrastructures. The scale and complexity of this data presents an analytics challenge on its own. So far the full potential hasn't been exploited. This data covers all the areas of the computing activities such as host monitoring, storage, network, batch system performance, user level job monitoring. Extracting useful knowledge from this data requires the use of state of the art data analytics tools and processes. The final purpose is to gain deep understanding of what determines the efficiency and how it can be improved.

ElasticSearch is a distributed, search and analytics engine that is used at CERN to store and process large amounts of monitoring data for several experiments.

It has been noted that differences in data access patterns lead to significantly different utilisation of the resources. However, the concrete causes and quantitative relations are still badly understood. In the same way job failures due a variety of underlying causes lead to loss of productivity, without knowing the exact causes and the concrete scale of the different issues.

To be able to improve the overall efficiency we suggest to studying the dependency of the performance on a variety of variables. Based on these findings, which could be obtained by classical and/or machine learning based data analysis techniques, new strategies should be developed. Since the expected gains are on the order of 10-20% the outcome of this work is of great importance for the experiments and the infrastructure providers.

The work in this project will be done in close collaboration with experts from CERN IT and the LHC experiments.

Required skills

Comfortable with Python programming. Some basic notion of statistics and probability.

Learning experience

Large scale data analytics with real world data, understanding of different approaches to handle the processing of data at the PByte scale in a complex distributed environment. Python data analysis ecosystem (NumPy, pandas, SciPy, matplotlib, Jupyter). Direct interaction with members of the LHC collaborations and an insight into their computing systems.

Project duration

3 to 6 months

Project area

Data Analytics

Contact for further details

Andrea.Sciaba@cern.ch

References

WLCG: http://wlcg.web.cern.ch/
ElasticSearch: https://www.elastic.co/

CERN group

IT-DI

Status

Submitted Submitted by sciaba on Monday, December 12, 2016 - 15:05.

Monday, December 12, 2016 - 15:05

CERN Accelerating science

Optimisation of experiment workflows in the Worldwide LHC Computing Grid

Project name

Project description

Required skills

Learning experience

Project duration

Project area

Contact for further details

References

CERN group

Status