Submitted projects in full
Performance optimization in a High Throughput Computing environmentProject description:
Profiling of computing resources respect to WLCG experiment workloads is a crucial factor to select the most effective resources and to be able to optimise their usage.
There is a rich amount of data collected by the CERN and WLCG monitoring infrastructures just waiting to be turned into useful information. This data covers all the areas of the computing activity such as (real and/or virtual) machine monitoring, storage, network, batch system performance, experiment job monitoring.
Data gathered by those systems contain great intrinsic value, however information needs to be extracted and understood through a predictive data analytics process. The final purpose of this process is to support decisions and improve the efficiency and the reliability of the related services.
For instance, with the adoption of the remote access of data it becomes mandatory to understand the impact of this approach to the job efficiency. Here the interplay of network and CPU effects, as well as the resource usage from multi VOs needs to be studied and understood. An interesting topic of study is the performance of job processing at the WLCG distributed T0 center, which is physically split between Computer Centers in Meyrin and Wigner. The goal of the project will be to understand the difference in the performance and to suggest possible optimization.
The work will be conducted in close contact with the experts (CERN analytics working group, system managers, developers) and will provide a deep insight into the computing infrastructure of a WLCG datacenter, its design, technical requirements and operational challenges.References:Project duration:6 to 12 monthsContact for further details:email@example.comLearning experience:Using analytics approaches already consolidated in other scientific domains, such as physics and finance, the candidate will learn and adopt techniques for data mining (trend analysis, result visualization, forecasting and predictive modeling) using cutting edge tools such as the analytics python ecosystem (IPython, numpy, matplotlib, scipy, pandas, scikit-learn, etc).Required skills:Python, matplotlib. Some experience in data analysis and statistics would be an advantage.Project area:Data AnalyticsReference to the project tracker:CERN group:IT-SDCStatus:Submitted
File Transfer Service (FTS) extensionsProject description:
The File Transfer Service (FTS) manages the global distribution
of LHC data, moving multiple petabytes per month during a run and underpinning the whole data lifecycle. Join the FTS team in their development of this critical service. Possible projects include
References:Project duration:From 3 months, depending on task selectedContact for further details:firstname.lastname@example.orgLearning experience:This project offers the chance to become involved with one of the critical data management systems used in computing for LHC andRequired skills:C++/Linux, PythonProject area:Data ManagementReference to the project tracker:CERN group:IT-SDCStatus:Submitted
- authorised proxy sharing: allowing a production service to delegate a proxy and authorising others to use it
- incorporation of support for new types of endpoint, for example cloud or archival storage
Dynamic storage federationsProject description:
The group runs a project whose goal is the dynamic federation of
References:Project duration:From 3 to 9 months depending on a selected taskContact for further details:email@example.comLearning experience:Thisproject offers experience in how advanced, distributed storage systems are being used to handle the peta-scale data requiremRequired skills:C++/LinuxProject area:Data ManagementReference to the project tracker:CERN group:IT-SDCStatus:Submitted
- HTTP based storage systems, allowing a set of globally distributed resources to be integrated and appear via a single entry point. The task is to work on the development of this project (“dynafed”), implementing functional and performance extensions, in particular
- Redirection monitoring, to allow the logging of federator behaviour for real-time monitorng and subsequent analytics
- Metadata integration, beginning with the incorporation of space usage information, allowing the federator to expose grid-wide storage metrics
- An endpoint status/management subsystem. The basic feature would be an interface that publishes endpoint status (up/down/degraded). Management functions could also be incorporated, including ways to add/enable/disable endpoints without having to restart the service.
- Semantic enhancements to the embedded rule-based authorization implementation, including turning the authorization subsystem into a pluggable authorization manager.
- Deployment tests and development with other Apache security plugins, to support natively Identity Federations, like the CERN SSO, Facebook, Google and others. May benefit from the previous points about authorization.
- Integration with experiment catalogues to benefit from available metadata and replica placement information.
The CERN volunteer computing platformProject description:
CERN-IT is developing a volunteer computing solution intended to be a common platform for the LHC experiments’ activities in this area and which should help to maximise the number of cycles they can acquire. The task is to accompany this project through its initial prototyping, work on all problems discovered and help to guide it to the level of maturity required for production. A major component of the system is based on the storage federation technology of the group (“dynafed”) which mediates data transfer between the trusted grid infrastructure and the untrusted volunteer domain.References:Project duration:3 monthsContact for further details:firstname.lastname@example.orgLearning experience:The project will give a chance to work on enabling a potentially large computing resource for the LHC experiments, and will giveRequired skills:Some experience with C++/Linux, Virtualisation technology, HTTP would be an advantageProject area:Data ManagementReference to the project tracker:CERN group:IT-SDCStatus:Submitted
QA in distributed cloud architecture: injection-fault testingProject description:
Clients of the sync&share system (CERNBOX) are particularly exposed to "operational failures" due to heterogeneity of hardware, OS and network environments.
Sync&share system operates in very heterogenous network environment: from fast, reliable network inside the computing center to unreliable, high-latency ad-hoc connections such as from air-ports etc.
Windows filesystems have substantially different semantics (e.g. locking) from Unix filesystems -- these difference affect the synchronization process
the goal of the R&D is to analyze the environment and identify the relevant classes of failures in order to provide a reproducible framework for injecting faults at the system level for testing client-server data transmission
* network slowdown or packet loss
* local disk failure
* checksum errors
* failed software upgrades
the work is supported by real monitoring and logging data: failure patterns in an existing service (CERNBOX)
QA in distributed cloud architecture: evolution of smashbox frameworkProject description:
Cloud synchronization and sharing is an area in evolution with innovative services being built on top of different platforms. CERNBOX is a service ran at CERN to provide at the same time synchronisation services (based on the OwnCloud software) and high-performance data access and sharing (based on EOS, the CERN disk storage system for large-scale physics data analysis).
The Smashbox framework (https://github.com/cernbox/smashbox) is successfully used on Linux clients to test OwnCloud/CERNBOX installations. The plans to extend it require to port it to non-Linux platforms:
* Smashbox port to Windows platforms
* Smashbox port to Android
* Smashbox port to iOS
* Smashbox orchestration (concurrent execution across platforms)References:Project duration:3-12 months depending on the agreed scopeContact for further details:email@example.comLearning experience:Testing, distributed data management, cloud storageRequired skills:languages: python, operating systems: at least one among windows, iOS, AndroidProject area:Data ManagementReference to the project tracker:CERN group:Status:Submitted
Adding Webhooks, or similar, support to DPMProject description:
Cloud storages - as Dropbox and Google Drive - implement an API that allows an authorized party to be notified via callbacks - Webhooks in Dropbox and Push Notifications in Google Drive - when some event occurs in the storage. For instance, when a file is uploaded, modified or destroyed.
The objective for this project is to study and implement a similar mechanism for DPM, so external software components can get notified about changes with no need for polling.
If time allows it, a simple proof-of-concept client could be provided (e.g, an web board or mail notifications)References:Project duration:3 mothsContact for further details:firstname.lastname@example.orgLearning experience:With this project the student will be exposed to a wide range of technologies: from insight about the DPM inners, up to web technologies. The student will also gain experience in system integration, designing which blocks to use and how do they fit each other in order to reach the final goal.Required skills:C++, Linux, Web and HTTPProject area:Data ManagementReference to the project tracker:CERN group:IT-SDCStatus:Submitted
Distributed storage systems for big dataProject description:
The group maintains a framework called dmlite which is used to integrate various types of storage with different protocol frontends. It is the basis of a number of the group’s products such as the Disk Pool Manager (DPM), a grid storage system which holds over 50PB of storage in the global infrastructure. DPM/dmlite extensions
The task is to contribute to the dmlite project by working on functional extensions to the framework. Example projects include
- Exposing system data through a “procfs” style plugin
- Incorporation of new AA mechanisms, eg outh
- Creation of a web admin interface
- Work on draining and file placement within the system
Help to realise the group's vision of a “dmliteSE” by working on the gradual retirement of legacy daemons within the DPM system. In this context, tackle the modernisation of pool management and file placement, and the incorporation of new resource types (eg cluster file systems) into the system. Complete the functional development required to allow operation of a disk storage system purely through standard protocols.References:Project duration:3 to 9 months depending on the selected taskContact for further details:email@example.comLearning experience:This project offers the chance to become involved with one of the storage systems used in computing for LHC and will give an oppRequired skills:C++/LinuxProject area:Data ManagementReference to the project tracker:CERN group:IT-SDCStatus:Submitted
The potential of HTTP proxy caches for LHC computingProject description:
Managing storage is one of the major contributors to operational costs on the LHC’s grid infrastructure (WLCG). The task is to design and prototype an HTTP proxy cache system, built using standard components, intended to allow pure unmanaged cache storage at a grid site or to accelerate data access in cloud environments.This project could reduce the costs of running the LHC’s grid infrastructure by removing storage management overheads at smaller sites and by improving the efficiency of cloud computing resources.References:Project duration:6 monthsContact for further details:firstname.lastname@example.orgLearning experience:This project offers the opportunity to understand how advanced, peta-scale storage systems work and to get to grips with the tecRequired skills:Some experience with System integration, HTTP, Linux, Python would be an advantageProject area:Data ManagementReference to the project tracker:CERN group:IT-SDCStatus:Submitted
Using data analytics for WLCG data transfer optimizationProject description:
The overall success of LHC data processing depends heavily on the stable, reliable and fast data distribution performed by the WLCG File Transfer Service (FTS). FTS transfers around 15 PB of data each month representing millions of files per day. The efficient functioning of this service is crucial for successful exploitation of the LHC data. The large scale of the transfer activity and the shared nature of the LHC computing
infrastructure, which is used by several virtual organizations, create a challenge for the FTS service.
The project proposes the exploration of the FTS historical monitoring data with the aim of improving the service efficiency. Data analysis should consider all kinds of transfer routes, protocols, and experiments’ data transfer workflows with various FTS configurations. The goal of the project is to assist the FTS3 infrastructure to sustain higher traffic while optimizing the resource usage and reducing data transfer latencies. This includes creating a data analytics platform for the FTS performance analysis and predictions.References:Project duration:9 monthsContact for further details:email@example.comLearning experience:The project offers an opportunity to contribute to the evolution of the WLCG data transfer service by taking part in the designRequired skills:Some experience with Python , SQL and basic knowledge of the TCP/IP protocol would be an advantageProject area:Data ManagementReference to the project tracker:CERN group:IT-SDCStatus:Submitted
Cloud data analysisProject description:
Cloud synchronization and sharing is a promising area for the preparation of powerful transformative data services.
The goal of this project is to prepare CERNBOX to be used in connection with heavy-duty activities (large-scale batch processing) on the current LXBATCH infrastructure (LSF) and on its future evolution (HT-Condor): physicists can enable their data to move across their private workstations (like a private laptop) while the bulk of the data is directly accessed from the EOS infrastructure. At the same time, users can control the progress of their activity via mobile clients (as a smartphone) via optimised client applications or via standard browsers.
The student will participate to the preparation and validation of these use cases. The student will participate to the deployment of the necessary infrastructure (EOS Fuse access from interactive and batch services), support the alpha users (physicists) and extend the current testing and validation system to these new use cases and to new platforms (acceptance tests – in connection with other sites running CERNBOX and monitoring – using the CERN monitoring infrastructure).
Advanced Notifications for Network IncidentsProject description:
One of the main challenges in LHCOPN/LHCONE networking is the network diagnostics and advanced notifications on the issues seen in the network. Currently, most of the issues are only visible by the applications and need to be debugged after the incident and performance degradation has already occurred. This is primarily due to the underlying complexity of the WLCG network (multi-domain) and difficulty to understand state of the network and how it changes over time. This project will aim to use the current open-source event processing systems (such as Spark/Hadoop) to automate detection and location of the network problems using the existing streams. The project will be done in collaboration with the NSF-funded PUNDIT.
The project will build on the standard WLCG perfSONAR network measurement infrastructure and will aim to gather and analyze complex real-world network topologies and their corresponding network metrics to identify possible signatures of the network problems. It will provide a real-time view on the existing diagnosed issues together with a list of existing downtimes from the network providers to the experiments operations teams.Project duration:12 monthsContact for further details:Marian.Babik@cern.chLearning experience:The student will acquire practical experience in design and development of the advanced notification platform based on network latencies, paths and throughputsRequired skills:TCP/IP networking, Java/Scala experience (Spark/Hadoop)Project area:Monitoring of the distributed infrastructureReference to the project tracker:CERN group:IT/CMStatus:Submitted
e-learning - video production and Academic Training video archive promotionProject description:
The Academic Training (AT) video archive in CDS contains a wealth of knowledge that we could promote in youtube as part of CERN's mission around Education. To prepare:
- Check other such sites on the web, e.g. NASA, Fermilab, Argonne, ESA, EPFL, UniGe, google, etc - also some sites of famous art institutions - and write down what we can learn from the best ones. This is now done, see HERE. The CERN communications team will make a few seconds' teaser to place at the entry of the future youtube channel. It can be inspired by TEDx talks' introduction, where keywords are flying around in music to introduce the topic.
- CERN Academic Training Committee members and lecture series' sponsors to select 'best-of' past series in CDS, classify them per discipline domain and equip them with keywords that will help web searched in the future.
- Prepare the script (in CERN IT CDA-IC section) that will merge the channels (slides mostly with temporary intersection of the lecturer's face) in order for the existing CDS records to be uploadable to youtube. In the IT e-learning project we documented the tool to use for that (ffmpeg). This is not there yet, in Audiovisual infrastructure.
- Make CDS video playlists per domain so they can be fed into youtube as such. This is not there yet, in CDS functionality but planned for the next release, later in 2017.
- Propose content for these corporate slides (with help by the CERN designer).
- Use an automated tool to wrap existing CDS video records in 'corporate' slides (start/end). This will be available by the Audiovisual infrastructure in the autumn 2017.
- Publish existing CDS records from the AT lectures' category in a new, dedicated CERN AcademicTraining (final name to be decided) youtube channel, inspired from relevant examples of other organisations, as concluded in point 1 above.
- Participate in the making of a short (1 minute) clip, by the CERN studio experts, that announces the series to attract audience interest.
- Document patterns that can lead to a process automation for future publishing of other CDS educational collections.
Project initiator/coordinator: Maria Dimou / Academic Training Committee chairperson and IT e-learning project leader.
Information on the collaborating unit within Geneva University:
Director: Mireille Bétrancourt
Professor: DanieK. Schneider
This activity is related to master programme MALTT http://tecfa.unige.ch/malttReferences:
A COAS request in this index page with request date 07/11/2016.Project duration:6 monthsContact for further details:Maria DimouLearning experience:This project requires both pedagogical and technical skills. The student will work as an editor and advisor, with the CERN Communications' team and the the Academic Training sponsors, to emphasise the interesting points, while respecting the historical content. Then he will need technical skills to do the video editing. Experience from interactions with users and educational material content owners, as well as the documentation and presentation of the results will be gained. Formal notification techniques of the conclusions drawn from data patterns observed in the video parametres of our archived AT lectures.Required skills:Good video and text editing. Modern education and documentation management knowledge.Project area:LearningReference to the project tracker:CERN group:IT-DIStatus:Submitted
Optimisation of experiment workflows in the Worldwide LHC Computing GridProject description:
The LHC experiments perform the vast majority of the data processing and analysis on the Worldwide LHC Computing Grid (WLCG), which provides a globally distributed infrastructure with more than 500k cores to analyse the tens of PB of data collected each year. Profiling of the computing infrastructure with respect to the impact of different workloads is a crucial factor to find the most efficient match between resources and use cases. From the current analysis it is clear that the efficiency is neither perfect nor well understood.
There is a rich amount of information collected by the communities' monitoring infrastructures. The scale and complexity of this data presents an analytics challenge on its own. So far the full potential hasn't been exploited. This data covers all the areas of the computing activities such as host monitoring, storage, network, batch system performance, user level job monitoring. Extracting useful knowledge from this data requires the use of state of the art data analytics tools and processes. The final purpose is to gain deep understanding of what determines the efficiency and how it can be improved.
ElasticSearch is a distributed, search and analytics engine that is used at CERN to store and process large amounts of monitoring data for several experiments.
It has been noted that differences in data access patterns lead to significantly different utilisation of the resources. However, the concrete causes and quantitative relations are still badly understood. In the same way job failures due a variety of underlying causes lead to loss of productivity, without knowing the exact causes and the concrete scale of the different issues.
To be able to improve the overall efficiency we suggest to studying the dependency of the performance on a variety of variables. Based on these findings, which could be obtained by classical and/or machine learning based data analysis techniques, new strategies should be developed. Since the expected gains are on the order of 10-20% the outcome of this work is of great importance for the experiments and the infrastructure providers.
The work in this project will be done in close collaboration with experts from CERN IT and the LHC experiments.Project duration:3 to 6 monthsContact for further details:Andrea.Sciaba@cern.chLearning experience:Large scale data analytics with real world data, understanding of different approaches to handle the processing of data at the PByte scale in a complex distributed environment. Python data analysis ecosystem (NumPy, pandas, SciPy, matplotlib, Jupyter). Direct interaction with members of the LHC collaborations and an insight into their computing systems.Required skills:Comfortable with Python programming. Some basic notion of statistics and probability.Project area:Data AnalyticsReference to the project tracker:CERN group:IT-DIStatus:Submitted
e-learning - IT Collaboration, Devices & Applications - document with the user in mindProject description:
The CERN IT Collaboration, Devices & Applications (CDA) group in general and the Integrated Collaboration (IC) section host services used widely at CERN and beyond. Examples from IC:
- Conference room equipment, configuration, documentation and support
- audiovisual services' support (webcast and recording)
- video conferencing (vidyo service)
- email service management and support
- IP telephony (e.g. skype for business)
- Login accounts, the CERN Single Sign-On (SSO) service, Authentication/Authorisation/Digital Certificates
In this project, we wish to understand the best way to document and promote these services with the user in mind, i.e. not considering service organisational, administrative or technical boundaries.
After getting the service managers' input, navigating through existing leaflets or web documentation and identifying out-of-date or obsolete parts in it, we shall come up with a proposal on:
- a user-friendly information organisation, based on user stories, rather than services.
- suggestions of quick How-Tos as the users would expect them, rather than sequencially describing each service or groups of services. The purpose here is to make our services more understandable and accessible.
- tips for best promoting existing documentation.
- a recommendation on how to best get feedback from users on our services or on the documentation itself...
There is already a wealth of information being assembled in a group internal page to facilitate the investigation and enhance the student's technical experience.References:
Related project: https://twiki.cern.ch/ELearning
Collaborating institute: https://www.hesge.ch/heg/en/core-programmes/bachelors-science/information-studiesProject duration:6 months Jan-June 2017 - 1 day/week, often from homeContact for further details:Maria DimouLearning experience:The variety of the services, their wide use and legacy documentation, or absence of it, offer opportunities to an external evaluator, to come up with technical views based on functionality, especially he/she has a special interest in Information Sciences and Documentation.Required skills:Understanding Information Organisation, Usability and Processes. Good notion of A to Z service workflows.Project area:LearningReference to the project tracker:CERN group:IT-CDAStatus:Submitted
Analysis of the I/O performance of LHC computing jobs at the CERN computing centreProject description:
The LHC experiments execute a significant fraction of their data reconstruction, simulation and analysis on the CERN computing batch resources. One of the most important features of these data processing jobs is their I/O pattern in accessing the local storage system, EOS, which is based on the xrootd protocol. In fact, the way experiment applications access the data can have a considerable impact on how efficiently the computing, storage and network resources are used, and has important implications on the optimisation and size of these resources.
A promising approach is to study the logs of the storage system to identify and characterise the job I/O, which is strongly dependent on
the type of jobs (simulation, digitisation, reconstruction, etc.). A direct link between the information in the storage logs and the information in the monitoring systems of the experiments (which contain detailed information about the jobs) is possible, as it can be derived from a cross analysis of the aforementioned data sources together with information from the CERN batch systems. The goal of this project is to study such connection, use it to relate I/O storage patterns to experiment job types, look for significant variations within a given job type, identify important sources of inefficiency and describe a simple model for the computer centre (batch nodes, network, disk servers) that would increase the efficiency of the resource utilisation.
In case inefficiencies are detected that could be alleviated by changes in the way experiments run their jobs, this information should be passed to the experiments.
The analysis can be initially based on the jobs of a single large LHC experiment (ATLAS or CMS) and extended to other experiments if time allows.References:Project duration:3 to 6 monthsContact for further details:Andrea.Sciaba@cern.chLearning experience:Large scale data analytics with real world data. Python data analysis ecosystem (NumPy, pandas, SciPy, matplotlib, Jupyter). Direct interaction with members of the LHC collaborations and an insight into their computing systems. Complex storage systems in a large data centre environment.Required skills:Python programming. Familiarity with data analytics techniques and tools is desirable.Project area:Data AnalyticsReference to the project tracker:CERN group:Status:Submitted