Voice interfaces and assistants
1) What VUI is and when it's needed
Voice interface (VUI) - a way of interacting through speech: assistants in the application/browser, smart speakers, IVR/telephony, voice in auto and TV.
Suitable for: hand-occupied scenarios (driving, kitchen), quick commands ("turn on...," "call..."), accessibility, navigation through complex menus.
Not suitable for: accurate visual selection (catalogs, tables), long entry of structured data without a screen.
2) Dialogue model: intents, entities and context
Intent: what the user wants: 'Create _ payment', 'Check _ balance'.
Slots/entities: target parameters: amount, currency, addressee, date.
Context/dialogue-state: what is already known, what we clarify, where we branch.
Confirmation rules: that we confirm explicitly (money, personal data).
json
{
"intent": "MakeDeposit",
"slots": {
"amount": {"type": "number", "required": true, "confirm": "sensitive"},
"currency": {"type": "currency", "required": true, "default": "UAH"},
"method": {"type": "payment_method", "required": false}
}
}
3) Patterns of dialogue
1. The team with one phrase: "Top up the account for 500 hryvnia Apple Pay." → confirmation → action.
2. Clarifying dialogue: "To whom to translate?" → "How much?" → confirmation.
3. Step-by-step wizard: complex scenarios with data validation and reverse step.
4. Intent recognition + NLU paraphrase: support for variable formulations.
5. Quick help: "What are the withdrawal limits?" - short answer + "Show on screen."
4) Wording: voice and tone
Brand voice: confident, calm, friendly; without diminutive and "jokes" in critical steps (payments, security).
Max. Assistant replica length: 1-2 sentences; long answers - break and suggest "Continue?"
Questions - specific: "How much to replenish?" instead of "What do we do next?"
5) Confirmations, safety and ethics
Tough confirmation of sensitive actions: pronounce key parameters ("Replenish by 500 hryvnia with a card... 4581? »)
Double confirmation for irreversible operations.
Without voicing full personal data.
Undo/Undo option: Undo, Stop, Undo Last Step.
6) Mistakes and misunderstanding
Failure types and responses:- ASR error (did not hear): "I did not hear the amount. Please repeat it"
- NLU-incomprehensible: "I did not understand the request. I can top up my account or show my balance. What will you choose?"
- Missing data/limitation: "This method is not available in your region. What are the other options?"
- Network/service: "Now there is no connection with the payment service. Do you want to try again in a minute?
Rule: a maximum of 2 attempts to query → offer an alternative (screen/person).
7) Speed and barge-in (interrupting)
TTFB latency: target <300-500 ms; if longer - a short "em-mm" signal/earcon.
Barge-in: the user can interrupt the assistant at any time; handle the interrupt correctly.
Streaming the answer: we start talking earlier than the entire text is ready, but without breaking the line.
8) TTS/ASR and SSML: How to Say "Human"
Pronunciation of numbers/currencies/dates: local formats ("p'yatsot hryvnia," "15 leaf falls").
Pauses and stresses: SSML '<break time = "300ms "/>', '<emphasis level =" moderate ">'.
Reading abbreviations/codes: '<say-as interpret-as = "characters"> IBAN </say-as>'.
Speed and timbre: no faster than 0. 9 × basic to be legible.
xml
<speak>
Top up on <say-as interpret-as = "cardinal"> 500 </say-as>
<sub alias = "hryvnia"> UAH </sub>?
<break time="300ms"/>
Please confirm.
</speak>
9) Multimodality: voice + screen
Visual cues: confirmation card, list of methods, progress.
Hand-off to the screen: "I sent options to the screen. Please select a method"
State synchronization: voice initiates, screen terminates (and vice versa).
10) Multilingualism and localization
Auto-detect language by session/tuning, not by single phrase.
Glossary of terms: common terminology for RU/UA/TR/EN.
Regional formats of numbers/currencies/dates, pronunciation of names/toponyms.
Switching in the dialog: "Go to ukraїnsku" is an explicit command.
11) Availability (A11y) in voice
Confirmation of action is clear and short.
Repeat on Demand: "Repeat" voices the final line.
Volume/speed: "Speak slower/quieter/louder."
For the hearing impaired: subtitles/transcript on the screen, vibration signals.
For speech disorders: alternative input methods (button, presets).
12) Confidentiality, logging and compliance
Wake-word and recording indicator: explicit "listening" state.
Local processing, if possible; otherwise, data minimization.
Masking sensitive fragments in logs (PAN, IBAN, address) and auto-editing audio.
Retention periods and right to remove on request; Do not save history settings.
Age restrictions/parental controls (children's voices/teams).
Transparency: "I am recording this command to improve recognition. Can be disabled in settings"
13) Assistant persona
Name/person: a short biography, area of competence that can/cannot.
Tone for situations: normal (friendly), critical (neutral), educational (supportive).
Boundaries: "I don't give financial advice, but I can show help."
14) VUI Quality Metrics
Intent recognition rate.
Slot fill rate и avg. turns to fill.
ASR WER/CER (Word/Character Recognition Error).
Task Success / Completion rate и Time-to-Complete.
Escalation rate (per operator/screen).
Barge-in usage и Latency p95.
User Satisfaction/CSAT after script.
Abandonment on step.
15) Voice testing and QA
Test phrase sets: synonyms, colloquial forms, accents, errors.
Environment noises: street/car/kitchen, different microphones.
Replay dialog: playable scripts, golden-set for regression.
Wizard-of-Oz in the early stages.
Legal scenarios: How an assistant responds to potentially dangerous requests.
16) Product integration (iGaming cases)
Balance/deposit/withdrawal: "What is the balance? , ""Replenish at 200 UAH...," "Output status."
Bonuses/Missions: "What bonuses are available? , ""Activate weekly cashback."
Responsible play: "Set a deposit limit of 1000 UAH per week."
Status of systems: "Are there any technical works now?"
17) Anti-patterns
Long monologues of the assistant without the opportunity to interrupt.
Implicit confirmations of monetary transactions.
Uncontested "did not understand" without prompting options.
Oversupplied sounds/jingles interfering with perception.
An attempt to "voice" solve problems where a detailed visual choice is needed.
18) Promts and answers templates
Slot refinement (sum):- Assistant: "How much to replenish the account?"
- User: "Five hundred."
- Assistant: "Replenish by 500 hryvnia? Please confirm"
- "Confirm replenishment by 500 hryvnia card... 4581. Say "confirm" or "cancel.""
- 'I didn't hear the payment method. I can offer: Apple Pay, card, crypto wallet. What will you choose?"
- "Sent available methods to the screen. Select and say "done" to continue"
19) Examples of SSML patterns
Numbers/Currency and Pause:xml
<speak>
Your current balance is
<say-as interpret-as="cardinal">1250</say-as>
<sub alias = "hryvnia"> UAH </sub>.
<break time="250ms"/>
Shall we continue?
</speak>
Emphasis on the important word:
xml
<speak>
<emphasis level = "moderate "> Caution </emphasis>: Verification is required for output.
</speak>
Pronunciation of the abbreviation:
xml
<speak>
Recharge with <say-as interpret-as = "characters"> IBAN </say-as>?
</speak>
20) Checklists
Pre-Release Dialogue/Content
- For each intent - a list of synonyms/phrase variants.
- One clear question per required slot.
- Sensitive actions - with explicit confirmation.
- There is a short on-screen/operator alternative.
- Replicas ≤ 2 suggestions; long - with "Continue? ».
Technique and quality
- barge-in is supported and return to dialogue after interrupting.
- p95 latency is normal; there are earcons on delay.
- SSML configured: pauses, numbers, stresses.
- Logs impersonal/masked; history management is.
- Multilingualism and local formats tested.
A11y and safety
- "Repeat/Speak Slower/Louder" works.
- Complete personal/payment data is not announced.
- There is a cancellation/rollback of the action by voice.
- Age and regional limits tested.
21) Dialog specification framework (template)
Purpose of the scenario: (for example, "Deposit ≤ 90 seconds")
Intents and synonyms: a list of example phrases.
Слоты: `amount` (req, confirm), `currency` (default=UAH), `method` (enum).
Confirmation rules for which values/thresholds to repeat.
Error options: ASR, NLU, no service - texts + branches.
Multimodal outputs: which cards/screens we show.
Logs and privacy: what and how we mask, TTL storage.
Final cheat sheet
First intents/slots/confirmation rules, then texts.
Speak briefly, let them interrupt and cancel.
Configure SSML, local formats, and tone by context.
Keep privacy and logging under control.
Measure Intent/Slot/ASR metrics, Task Success, and latency.
Always have an alternative to the screen and a path to the person.