Correct automatically transcribed videos for input to MLLP

Project name

Correct automatically transcribed videos for input to MLLP -Machine Learning Language Processing tool

Project description

The CERN IT Collaboration, Devices & Applications group and in particular sections Digital Repositories (IT CDA/DR) and Integrated Collaboration (IT CDA/IC), run many highly visible and popular services, which enable researchers/institutions to share and preserve their research data, software and publications as well as meet, present and record lectures, projects, plans and decisions of academic content and very large experiment collaborations.

The CERN Document Server (CDS) is the official document repository for the laboratory and annually serves around 2 million visitors. There are thousands of videos recorded at CERN via the CDA/IC recording and transcoding infrastructure. They are uploaded and viewable via CDS or the recent videos' portal.

We need to equip all CERN-made videos with subtitles.

This project is about post-processing and translating 6 hours of transcribed videos from the LHCP event from within the selected CERN videos for "teaching" tool MLLP, to extend its "vocabulary". The low quality of the existing transcription is often low due to environment noise. Some of the subtitles fixed by the speakers contain intended text and not actually spoken text. The aim includes the evaluation and fixing of automatic translation from english into french. Guidelines and related links in the references section below.

Output

Working document https://codimd.web.cern.ch/s/AxIt_f2PQ

Required skills

Scripting, documenting in Markdown, interest in science lectures, patience.

Learning experience

The project will give the student the opportunity to the student to work with a leading ML tool for transcription and be exposed to its algorithms and the development team.

Project duration

1-3 months

Project area

Data Management Data Analytics Learning

Contact for further details

Maria Dimou

References

Project notes https://codimd.web.cern.ch/rAX3vM6XTi657uCYsMyZXQ#
Machine Learning Language Processing (MLLP) https://ttp.mllp.upv.es/index.php?page=faq
MLLP guidelines https://cernbox.cern.ch/index.php/s/fGADW4qK610Ykyd
More notes and guidelines, MLLP-specific: https://codimd.web.cern.ch/CkA_VyauS_CYqXZrqPzPQg?view#Guidelines-for-human-verifed-transcripts
LHCP Conference https://indico.cern.ch/event/856696/

Reference for the Second stage:

E-learning video on how to download and modify subtitles https://cds.cern.ch/record/2276913?ln=en
CERN Lectures' videos in Youtube with auto-transcription https://www.youtube.com/watch?v=epVbtwPJbcI&list=UUwXkOx0EuKBR5m_OOiaZRUA&index=9
Amara, the application to customize the subtitles https://www.youtube.com/watch?v=epVbtwPJbcI&list=UUwXkOx0EuKBR5m_OOiaZRUA&index=9
CERN IT e-learning FAQ, how to use Amara https://it-e-learning.docs.cern.ch/video/faq/#q5-how-do-i-introduce-subtitles

CERN group

IT-CDA

Status

Accomplished Submitted by Maria Dimou on Monday, October 5, 2020 - 13:04.

Student info

Student name

Ahmed-Amine Hadjiat

University

University of Geneva - Faculté des Sciences - Centre Universitaire Informatique

CERN supervisor

Maria Dimou

Thesis

Thesis type

Master

Project started 16 Nov 2020
Project finished 15 Dec 2020

Defence status

other

Monday, October 5, 2020 - 13:04

CERN Accelerating science