Voice-based User Interfaces (VUIs), a.k.a. spoken dialogue systems, enable spoken communication between humans and machines, typically complementing traditional human-machine interaction modalities such as visual output (screen, head-up display) and haptic input (scroll wheels, buttons, etc.) with spoken language input and output. The main components of a dialogue system are (1) input modules for speech (ASR - Automatic Speech Recognition) and other modalities, (2) natural language interpretation, (3) a dialogue manager which takes interpreted user input and interacts with databases and services to provide appropriate and helpful responses, (4) natural language generation which renders system output in linguistic form, and (5) output modules for speech (TTS - Text To Speech) and other modalities.
Compared to other access modes, using voice-based user interfaces has several advantages (Cohen, Giangola & Balogh, 2004):
In order to minimise the driver’s distraction, and thus to increase safety, automotive user interfaces need to be adapted to the in-vehicle environment (Labskỳ et al., 2011). Voice-based user interfaces currently in use in cars are of two kinds: either a built-in system supplied with and integrated in the car, or a mobile device used while driving.
The state of the art in the latter category is perhaps best represented by Apple's Siri, supplied with the Apple iPhone 4S and later models. Siri offers speech-based dialogue interaction with several apps and services, using high quality server-based speech recognition tolerant for variation in how user utterances are formulated, intelligent backend integration with services, and relatively sophisticated dialogue management mechanisms such as context-dependent interpretation of user utterances. For example, a request for a taxi will be interpreted as a request for a taxi from the current location of the user to the user’s home.
While an impressive step forward for commercial speech-based interfaces, almost all interactions with Siri require the user to look at the screen on several occasions. For example, after asking for a restaurant nearby, the system presents a list of restaurants on the screen. However, the list is not presented using speech, so the user must look at the screen and click on one of the alternatives to proceed in the dialogue. This is not so much a design flaw as a consequence of the fact that Siri is designed for a user using both voice, eyes and hands to interact with the system.
Furthermore, Siri is lacking some basic dialogue behaviours that humans frequently depend on when interacting with other humans. As an example, if the user interrupts a task with another task (as one might well need to do e.g. to ask for a gas station while in the process of selecting what music to listen to), the first task is forgotten and any progress made there is lost, forcing the user to start again on the first task. Furthermore, if exposed to some information that does not explicitly state the associated task (e.g. "7 o'clock"), Siri tends to jump to conclusions about what the user wants the system to do with this information, rather than asking the user for a clarification. In effect, this forces the user to be more explicit when formulating utterances for Siri, which may result in a more distracting interaction.
The Google Now / Google Search system distributed with Android OS was originally limited to voice search but has recently been extended with the possibility of making calls and accessing other non-search features (Google, 2014). It is in many respects similar to Siri. Spoken interaction is used in conjunction with GUI/haptic interaction, and most interaction follows the pattern of an initial voice command and subsequent GUI/haptic interaction. Samsung S Voice distributed with Samsung handsets is another variation on the same theme, with the screen displaying the spoken dialogue so far, until an external app (such as the music player) is launched and regular GUI/haptic interaction takes over (Samsung, 2014).
Another recent mobile-based VUI, Microsoft's Cortana (Martin, 2014; Pathak 2014), is widely regarded as a response to Apple's Siri, with which it shares many properties. However, it has one novel feature of relevance to SIMPLI-CITY that is not present in Siri: an API enabling third party developers to voice-enable their apps by connecting them to Cortana. However, the current API does not support extended spoken dialogue interactions with third party applications. Rather, what is provided is the possibility to define a set of voice commands that the application can handle, optionally containing parameters. As it appears, no voice interaction with third party applications beyond simple commands is supported by Cortana. Also, voice commands are defined as strings, so there is no natural language processing such as syntactic parsing involved.
The state of the art in integrated in-vehicle speech interaction can be represented by the Nuance Automotive Speech system, as implemented e.g. in the Audi A8. This system offers spoken and multimodal interaction with a range of services and technologies in the car. More recently, Nuance has launched a re-vamped demo under the name of VoiceCar, which improves on Nuance Automotive Speech in certain respects. VoiceCar enables “one-shot searching” which essentially means that the user can give a command, which includes search parameters, such as “play X by Y”. VoiceCar also offers so-called “Modality Cross-over” which allows the user to switch between interaction using voice and traditional haptic/visual interaction (turning a knob and looking at a screen).
A recent development is the adaptation of mobile devices to the in-vehicle environment, exemplified by Apple's CarPlay (https://www.apple.com/ios/carplay/) (Savov, 2014) and the Open Automotive Alliance (http://www.openautoalliance.net) (Lavrinc, 2014) which aims to bring the Android OS to cars. The level of integration between mobile devices and vehicles, and the openness for third party developers in these systems, is as yet unclear. DespiteIn late 2014, Apple CarPlay became available to end users through an aftermarket device from Pioneer.
There are concerns in the industry about driver distraction and safety risks associated with bringing apps designed for mobile devices into the vehicle environment, and recognition of the need to create safe voice-based user interfaces (Yoshida, 2014). Recently, the AAA Foundation for Traffic Safety released a report (Cooper et al 2014) about cognitive distraction while interacting using voice with in-car and mobile infotainment systems. Concerning Siri, the report says that “Siri received the worst rating (...). Twice test drivers using Siri in a driving simulator rear-ended another car.” Unfortunately, the test was not conducted on the recent Apple CarPlay variant of Siri, but the authors claim that the test results are likely to be valid for this as well. It appears that these tests do not investigate visual distraction as a separate factor; hopefully, this will be done in future reports. After all, most if not all existing in-car or mobile dialogue systems are in fact multimodal which makes visual distraction a key factor for safe in-vehicle dialogue interaction. Larsson et. al. (2014) show that TALK's Speech Cursor solution, to be implemented in the SIMPLI-CITY platform, considerably reduces visual distraction in certain types of in-vehicle interaction.
In research on dialogue systems over the last three decades or so, a wide range of research systems have been built, with implementations ranging from very basic systems to full-fledged applications. Many different types of dialogues and architectures have been explored, ranging from simple state-based systems for information collection to advanced negotiation systems based on general mechanisms such as planning, plan recognition and inference. Over the last decade, research has moved from symbol processing methods of classical Artificial Intelligence and concentrated on developing statistical methods for dialogue management, but success has been limited. It may be noted that no existing commercial systems use statistical dialogue management.
Another active area of research has been speech-based multimodal interaction, often in an in-vehicle setting. A lot of research has been conducted comparing voice-based user interfaces to other access modes in a driving scenario; see e.g. Barón & Green (2006). This study concludes that although voice-based user interfaces still reduce driving performance, they also improve the driving quality significantly compared to other common interface types (e.g., text-based interfaces). The design of safe and non-distracting voice-based and multimodal user interfaces for in-vehicle use is an active area of research and development in academia and industry.
Today’s in-car user interfaces provide information, entertainment and comfort. Future user interfaces will have an increasing range of functionalities, including personalized and situation-aware information (Feld & Müller, 2011), as is also the aim of SIMPLI-CITY. It is expected that future voice-based user interfaces will be able to directly adapt to different drivers and surrounding conditions. In order to collect the relevant information needed by such a system, collaboration among a multitude of entities inside and outside a car (e.g., sensors, other in-car functions) must be enabled. For this purpose, a common understanding (e.g., in a form of an ontology) and a platform for knowledge exchange are required (Feld & Müller, 2011). This also underpins another important aim of SIMPLI-CITY, namely to enable the development of full multimodal dialogue interfaces for third party applications, a feature not supported by any commercially available VUI.