Overview

This is my (Nikolaos - Nikos - Flemotomos a.k.a. “team” DeepSpeech) final project report for the class Deep Learning and its Applications, as offered during the Fall semester of 2017-2018 at the University of Southern California.

What is SCD?

Speaker Change Detection (SCD) can be defined as dividing an audio signal into multiple audio chunks such that each of them denotes a speaker homogeneous region. Any time interval in the signal containing only one speaker can be thought of as a ‘speaker homogeneous region’. SCD is a very important issue in Speech Processing with various applications, such as Speaker Tracking (SCD + Speaker Verification) or Speaker Diarization (SCD + Speaker Clustering). In this work, time intervals of silence between speakers or overlapping speech were not taken into consideration. A hypothetical output of an SCD system is depicted in the following figure:

The mel-spectrogram of a signal where the timestamps of the true speaker turns (solid green lines) and the speaker turns predicted by a hypothetical Speaker Change Detector (dashed white lines) are denoted. The correct detections are denoted assuming that a reasonable tolerance margin is applied.

What are Siamese Neural Networks?

Siamese networks, as introduced in Chopra et al. 2005 and Hadsell et al. 2006, aim at learning a (dis)similarity metric between two inputs, by comparing their outputs, as produced by two identical subnetworks (with the same architecture and weights). Originally, it was suggested that this “comparison” is done based on the euclidean distance between the subnetworks’ outputs, but other approaches can be also taken, as discussed in the Implementation and experiments section.

Main Results

Assuming that the desired behavior is a balance between missed detections and false alarms (errors of Type I and II), the best system that was trained yielded an $MDR=29.22\%$, $FAR=27.08\%$ and $F_1 = 0.681$

Here is an image demo of the result for a small segment of the test set:

first row (up): the original signal in time domain, second row: the ground truth for the speaker turns, third row: the result of the proposed algorithm. (Sorry, no audio provided since TIMIT - which is used for testing - is a licensed database)

For extensive experimental results, as well as for details about the evaluation metrics used and their definitions, please refer to the Implementation and experiments section.

Let’s Have a Deeper Understanding…

In the following pages you can find details about