Being able to communicate with machines, such as robots, through spoken interaction has been a long-standing vision in both science fiction and research labs. This vision was for a long time hard to fulfil, mostly due to poor speech recognition performance. Thanks to recent developments, the performance has now passed the threshold where the technology is perceived as useful rather than painful. As an example, millions of Americans are talking to their smart speakers (Amazon Echo, Google Home) every day. Spoken dialog will also be essential for the interaction between humans and social robots, which are soon expected to serve as receptionists, teachers, companions, etc. One example of this is the Furhat robot head, which started as a research project at KTH, and is now being used in commercial applications, such as serving as a concierge at the Frankfurt airport.
However, despite this progress, current systems are still limited in several ways. In this talk, I will discuss some of the speech and language technology challenges that lie ahead, and that we are currently addressing in our research. First, the turn-taking in the interaction is typically not very fluent and not very similar to how we are used to talk to each other. I will present our efforts at modelling and utilizing the multi-modal signals that the face and voice provide, in order to continuously anticipate what will happen in the interaction, and how this can be used to coordinate the robot’s behaviour with the user. Second, current systems are typically built to handle one limited domain at a time. Extending the system to new domains is very costly and requires technical expertise. I will talk about how we addressed this issue in the Amazon Alexa challenge, in which KTH was selected to participate, where the challenge was to build a social chatbot that could talk about anything in a coherent and engaging manner with thousands of American users. A third limitation of current systems is that they are based on one generic model of the user and the language they are conversing in. Contrary to this, humans adapt their language to each other and invent terms for new things they need to talk about. I will present an initial study on how a computer could learn to understand the semantics of referring language by observing humans interacting with each other in a collaborative task, and how the language used by the two interlocutors converges after repeated interactions.
Gabriel Skantze is a Professor in speech technology with a specialization in dialog systems at the Department of Speech, Music and Hearing at KTH. His research on dialogue systems and human-robot interaction during the last 15 years has resulted in multiple nominations and awards for best papers at conferences. He has been the PI of several interdisciplinary projects, and he is currently serving on the scientific advisory board of SigDial, and on the jury for the IBM Watson AI XPRIZE, a five-year $5 million global challenge on AI and cognitive computation. In 2018, he has been leading a team of PhD and Master students that was selected to compete in the Amazon Alexa Prize, a $3.5 million university challenge to advance the state-of-the art in conversational AI. He is also co-founder and chief scientist of the company Furhat Robotics, and as a researcher he has collaborated with major companies that are developing social robotics, including Toyota and Honda.