Neural networks for singing voice processing in polyphonic music signals

  1. Chandna, Pritish
unter der Leitung von:
  1. Emilia Gómez Gutiérrez Doktorvater/Doktormutter

Universität der Verteidigung: Universitat Pompeu Fabra

Fecha de defensa: 23 von September von 2021

Gericht:
  1. Antoine Liutkus Präsident/in
  2. Marius Miron Sekretär/in
  3. Estefanía Cano López Vocal

Art: Dissertation

Teseo: 683675 DIALNET

Zusammenfassung

This thesis dissertation focuses on singing voice extraction from polyphonic musical signals. In particular, we focus on two cases; contemporary popular music, which typically has a processed singing voice with instrumental accompaniment and ensemble choral singing, which involves multiple singers singing in harmony and unison. Over the last decade, several deep learning based models have been proposed to separate the singing voice from instrumental accompaniment in a musical mixture. Most of these models assume that the musical mixture is a linear sum of the individual sources and estimate time-frequency masks to filter out the sources from the input mixture. While this assumption doesn't always hold, deep learning based models have shown remarkable capacity to model the separate sources in a mixture. In this thesis, we propose an alternative method for singing voice extraction. This methodology assumes that the perceived linguistic and melodic content of a singing voice signal is retained even when it is put through a non-linear mixing process. To this end, we explore language independent representations of linguistic content in a voice signal as well as generative methodologies for voice synthesis. Using these, we propose the framework for a methodology to synthesize a clean singing voice signal from the underlying linguistic and melodic content of a processed voice signal in a musical mixture. In addition, we adapt and evaluate state-of-the-art source separation methodologies to separate the soprano, alto, tenor and bass parts of choral recordings. We also use the proposed methodology for extraction via synthesis along with other deep learning based models to analyze unison singing within choral recordings.