Psychotherapy quality assessment is typically addressed by human raters who evaluate recorded sessions along specific behavioral codes, as defined by standard coding manuals. The recordings capture the complex series of interactions between the therapist and the client, and as such, they encode the active ingredients of the therapy. However, the time and cost barriers introduced by such a procedure lead to poor feasibility in real-world settings.

The question that naturally comes to mind - or, at least, to an engineer’s mind - is “can we use machine learning techniques to facilitate this task?”. This is exactly the question we are trying to answer through this research thread. With collaborators from the University of Washington, the University of Utah, the University of Pennsylvania, and the Arizona State University working in the clinical field, Signal Analysis and Interpretation Laboratory leads the engineering efforts towards automating behavioral coding in various types of psychotherapy, such as Motivational Interviewing (MI), Cognitive Behavioral Therapy (CBT), and coaching to decrease childhood obesity.

Automatic quality assessment of a psychotherapy session.
Image from Flemotomos et. al., Automated evaluation of psychotherapy skills using speech and language technologies, Behavior Research Methods, 2021

A robust system for the task described depends heavily on a speech pipeline which, given the raw audio signal,

  • extracts the appropriate acoustic features,
  • detects the voiced regions (voice activity detection),
  • segments such regions into speaker-homogeneous segments (speaker change detection),
  • groups the resulted segments into same-speaker clusters (speaker clustering),
  • assigns each cluster either to the therapist or the client (speaker role recognition),
  • and converts the speech to text (automatic speech recognition).

Such a pipeline is presented and analyzed in this paper.

After the rich transcription of the dialogue between the therapist and the client is generated, natural language processing methods can be applied towards the final behavioral coding. This paper describes the first effort found in the literature for automated CBT evaluation, purely based on linguistic features, while here we introduce a multimodal system applied on MI data. Representations based on contextualized language models (e.g., BERT) can be applied to improve the predictive power of systems used in behavioral coding, as shown in our works here – where we introduce a multi-task model augmented with thepary-related metadata – and here – where we incorporate a local quality estimator to analyze long therapy sessions.