Correct automatically transcribed videos for input to MLLP
Project name
Correct automatically transcribed videos for input to MLLP -Machine Learning Language Processing toolProject description
The CERN IT Collaboration, Devices & Applications group and in particular sections Digital Repositories (IT CDA/DR) and Integrated Collaboration (IT CDA/IC), run many highly visible and popular services, which enable researchers/institutions to share and preserve their research data, software and publications as well as meet, present and record lectures, projects, plans and decisions of academic content and very large experiment collaborations.
The CERN Document Server (CDS) is the official document repository for the laboratory and annually serves around 2 million visitors. There are thousands of videos recorded at CERN via the CDA/IC recording and transcoding infrastructure. They are uploaded and viewable via CDS or the recent videos' portal.
We need to equip all CERN-made videos with subtitles.
This project is about post-processing and translating 6 hours of transcribed videos from the LHCP event from within the selected CERN videos for "teaching" tool MLLP, to extend its "vocabulary". The low quality of the existing transcription is often low due to environment noise. Some of the subtitles fixed by the speakers contain intended text and not actually spoken text. The aim includes the evaluation and fixing of automatic translation from english into french. Guidelines and related links in the references section below.
Output
Working document https://codimd.web.cern.ch/s/AxIt_f2PQ
Required skills
Scripting, documenting in Markdown, interest in science lectures, patience.Learning experience
The project will give the student the opportunity to the student to work with a leading ML tool for transcription and be exposed to its algorithms and the development team.Project duration
1-3 monthsProject area
Data Management Data Analytics LearningContact for further details
Maria DimouReferences
- Project notes https://codimd.web.cern.ch/rAX3vM6XTi657uCYsMyZXQ#
- Machine Learning Language Processing (MLLP) https://ttp.mllp.upv.es/index.php?page=faq
- MLLP guidelines https://cernbox.cern.ch/index.php/s/fGADW4qK610Ykyd
- More notes and guidelines, MLLP-specific: https://codimd.web.cern.ch/CkA_VyauS_CYqXZrqPzPQg?view#Guidelines-for-human-verifed-transcripts
- LHCP Conference https://indico.cern.ch/event/856696/
Reference for the Second stage:
- E-learning video on how to download and modify subtitles https://cds.cern.ch/record/2276913?ln=en
-
CERN Lectures' videos in Youtube with auto-transcription https://www.youtube.com/watch?v=epVbtwPJbcI&list=UUwXkOx0EuKBR5m_OOiaZRUA&index=9
-
Amara, the application to customize the subtitles https://www.youtube.com/watch?v=epVbtwPJbcI&list=UUwXkOx0EuKBR5m_OOiaZRUA&index=9
-
CERN IT e-learning FAQ, how to use Amara https://it-e-learning.docs.cern.ch/video/faq/#q5-how-do-i-introduce-subtitles
CERN group
IT-CDAStatus
Accomplished Submitted by Maria Dimou on Monday, October 5, 2020 - 13:04.Ahmed-Amine Hadjiat
University of Geneva - Faculté des Sciences - Centre Universitaire Informatique
Maria Dimou
Project finished 15 Dec 2020