Monitoring data transfers and access activities in WLCG with Hadoop, Spark and the Lambda architecture.

Project name

Monitoring data transfers and access activities in WLCG with Hadoop, Spark and the Lambda architecture.

Project description

Monitoring the WLCG infrastructure requires to gather and to analyze high volume of heterogeneous data (e.g. data transfers, job monitoring, site tests) coming from different services and experiment-specific frameworks to provide a uniform and flexible interface for scientists and sites. The current architecture, where relational database systems are used to store, to process and to serve monitoring data, has limitations in coping with the foreseen extension of the volume (e.g. higher LHC luminosity) and the variety (e.g. new data-transfer protocols and new resource-types, as cloud-computing) of WLCG monitoring events.

The goal of this project is to build a new scalable data store and analytics platform, in collaboration with the Support for Distributed Computing (SDC) group, at the CERN IT department, which leverages on a stack of technology each one targeting specific aspects on big-scale distributed data-processing (commonly referred as lambda-architecture approach).

The project can be decomposed in three main objectives and areas of work. The first objective is the batch layer, to store a constantly growing dataset providing the ability to compute arbitrary functions on it. The second objective is the serving layer, to store the batch-processed views, using indexing techniques to make them efficiently query-able. The third objective is the real-time processing layer able to perform analytics on fresh data with incremental algorithms to compensate for batch-processing latency. Moreover, the real-time analytics layer can be used as input for active-reaction, adopting classical pattern matching approach to promptly detect errors and failures on the stream of monitoring events.

 

Required skills

Knowledge of distributed systems, Java or Scala programming

Learning experience

Hadoop, Flume, Spark , Spark-streaming

Project duration

1.5 years

Project area

Monitoring of the distributed infrastructure

Contact for further details

luca.magnoni@cern.ch

CERN group

IT-SDC

Status

Accomplished
Student Information
Student name: 
Uthayanath Suthakar
University: 
Brunel University - Uxbridge
CERN supervisor: 
Luca Magnoni
Thesis type: 
PhD
Project started: 
21 Apr 2014
Project finished: 
29 Nov 2015
Defence status: 
not scheduled yet

Reference to the project tracker

You are here