Workshop: Advances in Speech Technologies
June 23, 2011 | IRCAM
Speaker Recognition: A New Binary Representation
Laboratoire Informatique d'Avignon. University of Avignon
Speaker recognition main approaches are based on statistical modeling of the acoustic space. This modeling relies usually on a Gaussian Mixture Model (GMM) denoted Universal Background Model (UBM), with a large number of components and trained using a large set of speech data gathered from hundreds of speakers. Each target model is derived from the UBM thanks to a MAP adaptation of the gaussian mean parameters only. An important evolution of the UBM/GMM paradigm was to consider the UBM as a definition of a new data representation space defined by the concatenation of the Gaussian mean parameters. This space, denoted "supervector" space, allowed to use Support Vector Machine (SVM) classifiers feed by the supervector. A second evolution step was crossed by the direct modelling of the session variability in the supervector space using the Joint Factor Analysis (JFA) approach. More recently, the Total Variability Space was introduced, as an evolution of JFA. It consists on a modelling of the total variability in the supervector space in order to build a smaller space which concentrates the information and where it is easier to model jointly session and speaker variability. Looking at this evolution, three remarks could be proposed. The evolution is always linked to large models with thousands of parameters. All the new approaches are quite unable to work at the frame per frame level and finally, these approaches rely on the general statistical paradigm where one information is evaluated as strong when it is present very often. This speech proposes an analysis of the consequences of these remarks and presents a new paradigm for speaker recognition, based on a discrete binary representation, which is able to overpass the previous approaches limitations.
Nicolas Obin - IRCAM