As a result of the advent of new technology such as Artificial Intelligence (AI), several industries have undergone drastic changes; automating processes for better efficiency and performance. Collaboration is an essential component of a conversational experience in this context, and the best performance improvements are realized when humans and machines collaborate. Humans and AI can enhance one another's complementary strengths through cooperative intelligence, including leadership, teamwork, creativity, speed, scalability, and social skills.
Interactive communication with voice agents today fails to deliver a more human-centered style, instead it provides restricted and task-oriented interaction. More human-like conversational capacity can be useful in facilitating long-term Human-Agent interaction. Conversational agents claim conversational contact, but they don't deliver. Attempts also replicate practical rules from human speech without taking into account main features that a dialogue would encompass e.g use of natural language, understanding context, problem solving capabilities, etc.
Some of the reasons why this issue exists with voice agents are:
This project focuses solely on conversational UX, i.e UX for voice-based AI products like voice assistants. Recently, voice assistants have been developed to comprehend human language and complete tasks. Such tasks include reading text, placing phone calls, taking dictation, etc.
We tried to solve some of the problems and limitations by employing a psychological approach in designing utterances, empirical research and exploratory methods.
It’s crucial to provide a bridge between the different concepts and how they relate so that we can identify all possible scopes. As part of this process, we defined what might be the requirements or needs from an individual's perspective, what goals they would want to accomplish while cooking, which type of environment and situation they belong in. In order to build up an accurate coach persona, we found out the individual’s personality, behaviour, knowledge-base as a master chef. These scopes include;
We designed our agent to combine the characteristics of both an experienced and a celebrity chef from Italy in order to establish trust with users, as Italy is known for its pasta. The agent was designed to be high in openness, meaning it is adaptable to changing circumstances, and high in extraversion, meaning it is outgoing and sociable. Extraversion is particularly important for conversational agents as it focuses on social interaction.
We ran more than 22 versions/iterations of our dialog trees based on the definition of the initial conceptual and collaboration model.
Using data and information gathered from numerous cooking content across the internet we decided to work on the following pasta cooking skills. I worked on the Pasta sauce cooking skill.
To provide freedom to the user for choosing various pasta shapes and sauces, we also did research on possible shapes our agent can teach and the compatibility with various sauces. To reduce system development complexity, we reduced the options at final stage to 2 types of pasta shapes and one pasta sauce.
Set up into small groups, we assumed the role of either user or agent. Going through 15 sets of natural conversations like a coach teaching the four pasta cooking skills to a student based on the student’s requirements.
The wizard of oz method was used to simulate the conversational flow and have a better sense of our agent’s behaviour towards various user situations.
We worked on making the user more comfortable with precise measurements. i.e. Thickness of dough: the precise measurement is accompanied by a description.
To increase user confidence and provide a better experience, we implemented a strategy for making as many measurements and serving options as possible flexible and customizable. The system should be designed to adapt to the user's preferences by offering different options or branches.
We were able to get the stable version of our dialog tree, which we put into multiple iterations of WoZ phases. Below is a glimpse of part of our bigger tree, the pasta boiling skill learning conversation.
Our study examined 12 types of grounding actions that could improve collaboration between AI and humans; Greetings, Acknowledgements, Apologies, Commands, Confirmations, Discourse Matters, Endings, Errors, Greetings, Information, Questions and Suggestions
Based on our limited findings initially, we first identified 5 basic guidelines for our project, which we expected to be implemented during the final development of the system.
The final step was the implementation phase in Google’s DialogFlow platform. I transcribed our polished dialogue/decision tree into intents, contexts, events, entities, as well as implicit invocations.
Before starting the actual development, we devised a structure to interlink the intents with part of our dialog tree and improve the experience of interaction. Additionally, Speech Synthesis Markup Language (SSML) integration was also implemented to allow for a more human interaction.
Here are some of the results below;
Agent is able to track a time interval and notify the user accordingly. Additionally, the user can choose to listen to Italian music while following instructions or when a timer is running.
User is able to choose the serving quantity from 1-10 people, which can be varied upon the skill chosen. We limited the servings to a certain extent, because it would be hard for a beginner to handle.
For certain interactions, the user is able to resume the process from a specific point. For example, user might not want to cook the pasta just right now after shaping so they have the liberty to resume from last save point.
Agent is able to take command of the conversation at the required time, hold and continue the interaction or even release the control when necessary.
I conducted rigorous and continuous testing of the system and rebuilt our agent based on the bottlenecks and incorrect responses we found.
From the result of the test the agent couldn’t respond properly for missing contexts and failed to recognise a correct but incomplete phrase.
We were able to resolve the issue by limiting the duration of user sessions to a maximum of 5 minutes. This helped to avoid complexity. If a longer duration was required, we implemented an implicit invocation, such as having the user say "Talk to chef Antonio and say I waited 15 minutes" during a waiting period in the dough-making process. Originally, we had considered including a timer with the option to play music, but we ultimately decided that a 5-minute time frame was the most practical solution considering all factors.
One of the challenges we encountered was managing remote group work across different time zones. Additionally, our approach to developing the agent involved implementing four different options (making dough, cutting and shaping dough, boiling, and making pesto sauce), which required a lot of intents and a framework to track user progress and conversation flow. Testing was a particular challenge, as any errors that were identified required us to start the testing process from the beginning.
In the future, we can test our system with real users to gather feedback and identify areas for improvement. We may also conduct surveys to better understand different user contexts and gather recommendations. Once we have refined our system, we can publish it on the Google Action Console for others to use.