Speaker role recognition is the task of assigning a specific role in a speaker-homogeneous segment, where a role is characterized by the task a speaker performs. Broadcast news programs, call centers, therapy sessions, or interviews are some examples of conversational scenarios where each participant performs some well-defined task and thus plays some role.

Examples of conversational interactions where speakers assume distinct roles.
Images from the Noun Project; creators: Nubaia Karim Barsha, Gan Khun Lay, Arafat Uddin, Llisole, ProSymbols

The approaches that tackle this task depend on successful pre-processing steps applied on the conversation, such as speaker diarization or Automatic Speech Recognition (ASR). However, in real-world settings, this leads to error propagation. Additionally, knowledge of the roles can be proved useful for those pre-processing steps, as well. For example, the language used by an interviewer is expected to be different than the one used by the interviewee, and this is something which can be ideally exploited by a role-aware ASR system by using role-specific language models.

What we are trying to do in this research thread is combine speaker role recognition with other speech processing modules in order both to improve the performance of the role recognition itself and to provide useful contextual information to the module it is combined with. Since, in a real-world scenario, the role recognizer gets the textual information from an ASR system, here we propose building role-specific ASR outputs through lattice rescoring, exploiting the rich information in the generic decoding lattice before pruning it. In this paper we propose an end-to-end rich transcription system which performs simultaneously the tasks of ASR, diarization, and role recognition for the case of dyadic interactions. Here we combine a speaker role recognizer with a traditional agglomerative speaker clustering module, exploiting information from both the acoustic and linguistic modalities. Finally, this paper presents a way to exploit in-domain information about speaker roles in order to improve the performance of a traditional speaker diarization system, answering the question ``which role spoke when”.