Speaker Role Recognition

Individuals assume distinct roles in different situations throughout their lives and people who consistently adopt particular roles develop specific commonalities in behavior. As a result, roles can be defined in terms of observable tendencies and behavioral patterns that can be manifest through a wide range of modalities during a conversational interaction. For instance, an interviewer is expected to use more interrogative words than the interviewee and a teacher is likely to speak in a more didactic style than the student.

Examples of conversational interactions where speakers assume distinct roles.
Images from the Noun Project; creators: Nubaia Karim Barsha, Gan Khun Lay, Arafat Uddin, Llisole, ProSymbols

Speaker role recognition is the task of assigning a role label in a speech segment where a single speaker is active, through computational models that capture such behavioral characteristics. The approaches that tackle this problem depend on successful pre-processing steps applied on the recorded conversation, such as speaker segmentation and clustering or automatic speech recognition (ASR), something that inevitably leads to error propagation. At the same time, accurate role information can provide valuable contextual cues for the aforementioned speech processing tasks.

What I was trying to do in this research thread, which was the subject of my PhD dissertation, supervised by Professor Shrikanth Narayanan, is combine speaker role recognition with other speech processing modules, in order to i) alleviate the problem of error propagation and ii) improve the performance of the modules that role recognition is combined with. Since, in a real-world scenario, the role recognizer gets the textual information from an ASR system, here we propose building role-specific ASR outputs through lattice rescoring, exploiting the rich information in the generic decoding lattice before pruning it. In this paper we propose an end-to-end rich transcription system which performs simultaneously the tasks of ASR, diarization, and role recognition for the case of dyadic interactions. Here we combine a speaker role recognizer with a traditional agglomerative speaker clustering module, exploiting information from both the acoustic and linguistic modalities. Finally, those papers present a way to utilize in-domain information about speaker roles in order to improve the performance of a traditional speaker diarization system, answering the question ``which role spoke when”.