UX For AI: Designing for voice

Crafting a voice-centric, conversational skill-teaching experience tailored for specific users through the application of empirical methods and collaborative intelligence.

University of Siegen (concept project)
User Research, Dialogue design, AI Design
View research report


As a result of the advent of new technology such as Artificial Intelligence (AI), several industries have undergone drastic changes; automating processes for better efficiency and performance. Collaboration is an essential component of a conversational experience in this context, and the best performance improvements are realized when humans and machines collaborate. Humans and AI can enhance one another's complementary strengths through cooperative intelligence, including leadership, teamwork, creativity, speed, scalability, and social skills.


Interactive communication with voice agents today fails to deliver a more human-centered style, instead it provides restricted and task-oriented interaction. More human-like conversational capacity can be useful in facilitating long-term Human-Agent interaction. Conversational agents claim conversational contact, but they don't deliver. Attempts also replicate practical rules from human speech without taking into account main features that a dialogue would encompass e.g use of natural language, understanding context, problem solving capabilities, etc.
Some of the reasons why this issue exists with voice agents are:

  • They lack context in speech and not truly conversational in nature.
  • They’re designed to act as “information collectors.”
  • The dialogues are communicated in written language and not spoken.

I recognized that it’s possible that the technology itself isn't the cause of a bad experience but rather the design of utterances

Gathering insights

This project focuses solely on conversational UX, i.e UX for voice-based AI products like voice assistants. Recently, voice assistants have been developed to comprehend human language and complete tasks. Such tasks include reading text, placing phone calls, taking dictation,  etc.

Research shows that:


We tried to solve some of the problems and limitations by employing a psychological approach in designing utterances, empirical research and exploratory methods.

Collaborative conceptual modeling

It’s crucial to provide a bridge between the different concepts and how they relate so that we can identify all possible scopes. As part of this process, we defined what might be the requirements or needs from an individual's perspective, what goals they would want to accomplish while cooking, which type of environment and situation they belong in. In order to build up an accurate coach persona, we found out the individual’s personality, behaviour, knowledge-base as a master chef. These scopes include;

  1. Users needs & context
  2. Agent personality

User, needs & context

Defining the agent personality

We designed our agent to combine the characteristics of both an experienced and a celebrity chef from Italy in order to establish trust with users, as Italy is known for its pasta. The agent was designed to be high in openness, meaning it is adaptable to changing circumstances, and high in extraversion, meaning it is outgoing and sociable. Extraversion is particularly important for conversational agents as it focuses on social interaction.

Initial Tree Based Dialog Map

We ran more than 22 versions/iterations of our dialog trees based on the definition of the initial conceptual and collaboration model.

First iteration

Second iteration

Cooking skill research

Using data and information gathered from numerous cooking content across the internet we decided to work on the following pasta cooking skills. I worked on the Pasta sauce cooking skill.

To provide freedom to the user for choosing various pasta shapes and sauces, we also did research on possible shapes our agent can teach and the compatibility with various sauces. To reduce system development complexity, we reduced the options at final stage to 2 types of pasta shapes and one pasta sauce.

Role play & dialogue research

Set up into small groups, we assumed the role of either user or agent. Going through 15 sets of natural conversations like a coach teaching the four pasta cooking skills to a student based on the student’s requirements.

Key takeaways from the Role Playing Phase included but weren't limited to
  • Unpredictable conversation: It is important to have a generic utterance that fits different unexpected scenarios without sounding like a "design error" to avoid user distrust of the agent.
  • Flexibility without repetition: Once the user answers a question it should store the answer and provide an alternative flow. The user's preferred measurement unit, for instance, doesn't have to be asked again or assumed to always be preferred. i.e. “Now before we start, tell me; are we using [X] as the measurement unit this time too?”
  • Resource confirmation: Availability of Ingredients before giving instructions is identified as critical to the success of the implementation. For this, provide a voice-based checklist before starting with the skill coaching.
  • Clarifying questions: When steps haven't been completed. Agent can ask the user whether he/she wants to resume from where the conversation ended or Start all over again. This is especially important when the full experience mode has been selected. The user doesn't have all the ingredients needed.

Wizard of OZ (WoZ)

The wizard of oz method was used to simulate the conversational flow and have a better sense of our agent’s behaviour towards various user situations.

After analyzing the results from the tests, we faced a challenge:

We worked on making the user more comfortable with precise measurements. i.e. Thickness of dough: the precise measurement is accompanied by a description.

To increase user confidence and provide a better experience, we implemented a strategy for making as many measurements and serving options as possible flexible and customizable. The system should be designed to adapt to the user's preferences by offering different options or branches.

We were able to get the stable version of our dialog tree, which we put into multiple iterations of WoZ phases. Below is a glimpse of part of our bigger tree, the pasta boiling skill learning conversation.

Modified dialogue tree after wizard of Oz exercise

Grounding the communication

Our study examined 12 types of grounding actions that could improve collaboration between AI and humans; Greetings, Acknowledgements, Apologies, Commands, Confirmations, Discourse Matters, Endings, Errors, Greetings, Information, Questions and Suggestions

Utterance Classification for a Pasta Boiling Skill

AI design guidelines

Based on our limited findings initially, we first identified 5 basic guidelines for our project, which we expected to be implemented during the final development of the system.

  1. Customize the system for effective interaction
    System - user
    Ask clarifying questions before responding in accordance with the actual user needs, so that the system can avoid user frustration and customize the user experience. e.g. define the type of cake, and find similar cake suggestions
    User - system
    The mechanism should allow the user to customize systems, technical issues etc.
  2. Refine technical terms and visualsThe system should have a digital screen to aid users with visual content and describe visual hints of milestones. e.g showing sample pictures of steps or unknown words or phrases
  3. Pre-checking the process to meet users' needs
    Identify the user's ingredients, materials, and tools to verify that the users have all that is required before starting the process.
  4. Having flexible and dynamic conversation flow
    The assistant is able to follow up with the user’s questions even with long pauses during the application.
  5. Provide supportive features and alternatives
    Offer task-related  supportive tools and alternatives

Dialogue development & testing

The final step was the implementation phase in Google’s DialogFlow platform. I transcribed our polished dialogue/decision tree into intents, contexts, events, entities, as well as implicit invocations.
Before starting the actual development, we devised a structure to interlink the intents with part of our dialog tree and improve the experience of interaction. Additionally, Speech Synthesis Markup Language (SSML) integration was also implemented to allow for a more human interaction.
 Here are some of the results below;

Implement music-based timer option

Agent is able to track a time interval and notify the user accordingly. Additionally, the user can choose to listen to Italian music while following instructions or when a timer is running.

Option to choose servings

User is able to choose the serving quantity from 1-10 people, which can be varied upon the skill chosen. We limited the servings to a certain extent, because it would be hard for a beginner to handle.

Resume Option

For certain interactions, the user is able to resume the process from a specific point. For example, user might not want to cook the pasta just right now after shaping so they have the liberty to resume from last save point.

Control Conversational Flow (Turn Taking, Fallback)

Agent is able to take command of the conversation at the required time, hold and continue the interaction or even release the control when necessary.

Agent testing

I conducted rigorous and continuous testing of the system and rebuilt our agent based on the bottlenecks and incorrect responses we found.

From the result of the test the agent couldn’t respond properly for missing contexts and failed to recognise a correct but incomplete phrase.


We were able to resolve the issue by limiting the duration of user sessions to a maximum of 5 minutes. This helped to avoid complexity. If a longer duration was required, we implemented an implicit invocation, such as having the user say "Talk to chef Antonio and say I waited 15 minutes" during a waiting period in the dough-making process. Originally, we had considered including a timer with the option to play music, but we ultimately decided that a 5-minute time frame was the most practical solution considering all factors.

One of the challenges we encountered was managing remote group work across different time zones. Additionally, our approach to developing the agent involved implementing four different options (making dough, cutting and shaping dough, boiling, and making pesto sauce), which required a lot of intents and a framework to track user progress and conversation flow. Testing was a particular challenge, as any errors that were identified required us to start the testing process from the beginning.

Next steps

In the future, we can test our system with real users to gather feedback and identify areas for improvement. We may also conduct surveys to better understand different user contexts and gather recommendations. Once we have refined our system, we can publish it on the Google Action Console for others to use.