Best Practices for Evaluating ASR and TTS in Companion AI Assistants

I will research the best practices and methodologies for evaluating ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) in companion AI assistants. This will include evaluation set preparation, key performance metrics, industry standards, and case studies. I'll provide insights from real-world projects and academic research. I’ll update you once I have gathered the necessary information.

Introduction

Companion AI assistants rely on Automatic Speech Recognition (ASR) to understand users and Text-to-Speech (TTS) to respond with natural voice. Ensuring these components are high-quality is critical for a good user experience. Evaluating ASR and TTS involves preparing robust test sets, using objective and subjective metrics (e.g. WER for ASR accuracy, MOS for TTS quality), measuring latency, and following industry-standard methodologies. Below, we outline best practices for evaluating ASR and TTS, including how to prepare evaluation data, key performance metrics, and real-world evaluation methodologies and case studies.

Evaluating Automatic Speech Recognition (ASR) Performance

Preparing ASR Evaluation Datasets

Creating a thoughtful evaluation dataset is the first step to reliably measure an ASR system’s performance. Diversity and realism are key considerations: the test set should represent the range of scenarios the assistant will encounter. Best practices for ASR evaluation data include:

  • Collect Diverse Speakers and Accents: Include voices of different genders, ages, accents, and dialects. Many teams initially overlook diversity, leading to bias and poor recognition for certain user groupswww.willowtreeapps.com. Actively seek out varied accents and speaking styles to ensure the ASR works for a broad user base.
  • Include Noisy and Realistic Conditions: Incorporate background noise, various microphone qualities, and distances. Real-world speech often occurs with noise or over imperfect connections, so test data should reflect those conditionswww.willowtreeapps.com. This ensures the ASR’s noise robustness is evaluated, not just its performance in quiet settings.
  • Cover Typical and Edge-case Utterances: It’s easy to focus on common phrases and miss unusual but important caseswww.willowtreeapps.com. Brainstorm edge cases (e.g. very fast speech, uncommon words, slang, or tricky proper nouns) and include them. This helps reveal where the ASR might fail on less frequent inputs.
  • Use Real Speech (Augmented with Synthetic): If possible, have human speakers record the test utterances for natural variability. Synthetic voices (TTS-generated) can supplement to quickly expand diversity, but avoid over-reliance on synthetic datawww.willowtreeapps.com. Real human recordings capture spontaneous speech nuances that TTS might miss, like natural hesitations or coarticulation.
  • Define a “Gold Standard” Transcript: For each test audio, maintain a reference text (what should be recognized). This gold standard will be used to compare the ASR output and calculate accuracywww.willowtreeapps.com. Ensure transcripts are carefully vetted for correctness since they serve as ground truth. By preparing a comprehensive evaluation set with varied speakers, noise conditions, and utterance types, you can fairly assess the ASR’s accuracy and robustness before deployment. As one voice assistant developer notes, capturing many different voices and scenarios in testing is essential to uncover issues before users dowww.willowtreeapps.comwww.willowtreeapps.com.

Key ASR Performance Metrics (Accuracy and Latency)

Once the test set is ready, we need to quantify ASR performance. The primary metric for ASR accuracy is Word Error Rate (WER). WER compares the ASR’s transcript to the reference transcript, calculating the proportion of errors (substitutions, deletions, insertions) per total wordswaywithwords.net. For example, a WER of 5% means 5% of the words were recognized incorrectly. A lower WER indicates higher accuracy, making it the de facto standard in industry and academia for ASR evaluationmachinelearning.apple.com. WER is useful for benchmarking models and tracking improvements over timewaywithwords.net. However, WER alone has limitations: it treats all errors equally, even though some mistakes are more damaging to understanding than othersmachinelearning.apple.com. For instance, misrecognizing a keyword (“turn off the lights” vs “turn on the lights”) is far worse than dropping a minor filler word. Thus, teams often look beyond raw WER to get a fuller picture of ASR quality.

Other metrics and qualitative assessments used alongside WER include:

  • Semantic Understanding Metrics: Especially for voice assistants, capturing the meaning correctly is more important than verbatim transcriptionwww.amazon.science. An ASR might skip a trivial word yet still capture the user’s intent. To account for this, researchers propose metrics that measure semantic similarity between the ASR output and reference. For example, Amazon scientists developed a system to detect if an ASR hypothesis preserves the semantic meaning of the request or notwww.amazon.science. This helps identify “significant errors” – those likely to cause a misunderstanding of the user’s intent – even if the overall WER is low. In practice, teams might compute an intent error rate or use human judgment to flag transcripts that would lead the assistant to the wrong action.
  • Major/Minor Error Classification: Similarly, Apple’s ASR team introduced a refined metric called Human Evaluation WER (HEWER) to focus on errors that impact transcript readability and meaning. HEWER ignores minor issues like filler words or harmless misspellings and only counts major errors that change meaning or hinder understandingmachinelearning.apple.commachinelearning.apple.com. In an Apple case study transcribing podcasts, the raw WER was about 9.2%, but the “major” error rate (HEWER) was only 1.4%, indicating the transcripts were much better than WER alone suggestedmachinelearning.apple.com. This kind of nuanced evaluation is an emerging best practice to ensure ASR quality is judged by user-relevant criteria, not just raw error counts.
  • Sentence Error Rate (SER): This metric measures the percentage of sentences that contain any error. It’s a harsher measure – if a single word in a sentence is wrong, that sentence is counted as incorrect. SER is useful for tasks where an entire command must be understood perfectly or not at all. For example, a voice command either succeeds or fails; SER tracks how often failures occur. Teams sometimes use SER alongside WER to gauge whole-utterance success ratewaywithwords.net.
  • Latency (Response Time): In interactive assistants, speed is as important as accuracy. Latency refers to how long the ASR takes to process speech and return a result. High accuracy is great, but if it takes several seconds to get a transcription, the user experience suffers. Therefore, real-time or near-real-time transcription is a requirementwaywithwords.net. Typically, modern ASR systems aim to operate faster than the spoken audio duration (a real-time factor < 1.0). For streaming speech input, partial results should appear almost instantaneously as the user speaks. Industry practice is to measure end-to-end ASR latency from speech input to final text output. Many voice assistants strive for total response times (ASR + response generation + TTS) within a second or two, so the ASR component is budgeted only a few hundred milliseconds. There is an inherent trade-off between accuracy and speed – larger, more accurate models may run slowerwww.dialpad.comwww.dialpad.com. Engineers thus optimize models and use techniques like streaming decoding to balance this trade-off. The goal is to keep ASR latency low enough that interactions feel natural and uninterrupted. In summary, WER remains the key metric for ASR accuracymachinelearning.apple.com, but it should be complemented with deeper analysis. Looking at semantic errors or task success rates ensures the ASR is evaluated on what truly matters for an assistant’s performance (did it understand the user?). And beyond accuracy, latency and real-time capabilities are essential metrics to track, since even the most accurate ASR isn’t useful if it can’t respond in a timely manner.

ASR Evaluation Standards and Methodologies in Practice

Industry standards for ASR evaluation typically involve testing on established datasets and using consistent metrics like WER. For example, ASR systems are often benchmarked on datasets like LibriSpeech or Switchboard to compare performance using WER under common conditions. In the voice assistant domain, companies also create internal test sets representative of their users’ requests (including far-field speech from smart speakers, conversational queries, etc.) and measure WER on those to track real-world performance. It’s standard to report WER along with conditions (noisy vs quiet, native vs accented speakers, etc.), since context greatly affects error rateswaywithwords.net.

In real-world projects, evaluation methodologies frequently combine objective metrics with human judgment. A common workflow is: run the ASR on a large sample of real user utterances (or realistic simulated queries), then have human annotators review a subset of transcripts. This reveals errors that metrics might miss and helps assess if errors would lead to misunderstanding or just minor typos. For instance, Amazon’s Alexa team uses manual review to identify high-impact ASR errors that lead to failed voice commands, helping prioritize what to fix. The earlier Amazon Science study is an example where an automated system was built to flag transcription differences that are semantically significant, because those are the errors likely to cause downstream task failureswww.amazon.science. Similarly, Apple’s HEWER metric came from having annotators classify errors in transcripts by severitymachinelearning.apple.com– an approach that blends human evaluation with metric design.

Another best practice is end-to-end voice assistant testing: rather than evaluating ASR in isolation, the ASR output is fed into the assistant’s language understanding and the overall success of answering a query is measured. One industry case study outlines a four-step end-to-end test: (1) Create a “golden” set of user queries with expected assistant responses, (2) Generate audio for these queries with many voices and accents, (3) Run the audio through the full voice assistant (ASR → NLU → response generation → TTS), (4) Compare the assistant’s actual responses to the expected answerswww.willowtreeapps.com. Such holistic testing can reveal if ASR errors (or any component errors) actually cause user-visible failures. If the assistant still responds correctly, an ASR mistake on an irrelevant word might be tolerated; if the assistant misbehaves, that’s a critical issue. This methodology was applied in a taco-ordering voice app example: the team synthesized diverse user utterances (or recorded them) and verified the assistant’s spoken reply was correct for each scenariowww.willowtreeapps.comwww.willowtreeapps.com. This end-to-end approach is increasingly common in industry to ensure the ASR and subsequent components work together seamlessly.

Case Study – Custom Metrics for ASR: A noteworthy example comes from Apple’s evaluation of podcast transcription. Standard WER was used to gauge overall accuracy, but Apple found it necessary to invent a new metric (HEWER) to focus on errors affecting user experience (readability and meaning)machinelearning.apple.com. By analyzing transcripts, they discovered that many “errors” WER counted were actually minor (e.g., including a filler word or a harmless casing difference). By ignoring those, HEWER gave a much lower error rate (1.4% vs 9.2% WER) and correlated better with what a user would consider a high-quality transcriptmachinelearning.apple.com. This case highlights that industry teams often adjust their evaluation criteria to match the application’s needs. In voice assistants, an analogous practice is to focus on command understanding rate – e.g., measuring the percentage of voice commands handled correctly end-to-end, which might forgive small ASR glitches as long as the intent was correctly captured.

Overall, evaluating ASR for companion AI assistants should follow industry standards (like reporting WER) while also incorporating use-case-specific metrics and tests. Ensuring tests cover realistic conditions and using semantic or user-centric error measures will lead to a more accurate reflection of an ASR system’s usefulness in a real assistant.

Evaluating Text-to-Speech (TTS) Quality and Performance

Preparing TTS Evaluation Sets

Evaluating a TTS system requires test data that reflects the utterances and styles the assistant will speak. Unlike ASR, where the input is audio, for TTS the input is text and the output is audio. Key guidelines for preparing TTS evaluation sets include:

  • Representative Text Samples: Collect a set of sentences or prompts that cover the range of things the assistant will say. This should include typical user-facing responses (e.g. answering questions, giving confirmations, small talk), and any special content like reading out names, dates, or addresses. If the assistant must convey emotion or varying tone, include prompts that elicit those speaking styles. A diverse text set ensures the TTS is evaluated on simple sentences as well as phonetically or linguistically challenging ones (tongue-twisters, uncommon words, etc.). For example, one TTS benchmark included various real-world use case phrases to compare voice quality across systemssmallest.ai.
  • Coverage of Phonetic/Linguistic Variety: Ensure the evaluation texts cover a broad set of phonemes, stress patterns, and sentence lengths. In practice, TTS evaluations often include phonetically balanced sentences or even semantically unpredictable sentences (SUS) – sentences that are grammatically correct but meaningless, used to test intelligibility without semantic context. In the Blizzard Challenge (a yearly TTS evaluation), systems are tested on SUS to assess raw intelligibility, with listeners transcribing the speech and computing WER on those transcriptswww.isca-archive.org. By including such material, evaluators can catch if certain sounds or words are consistently mispronounced or unintelligible.
  • Reference Audio (for comparison): If evaluating how natural or human-like the TTS is, it helps to have reference recordings. Many studies include a human narration of the same test sentences as a “ground truth” for natural speech quality. This can be used in certain test paradigms (e.g., ABX tests where listeners compare the synthetic voice to a real voice, or MOS comparisons between synthetic and human speech). If the assistant’s voice is modeled after a specific persona or voice actor, recordings of that voice actor can serve as a reference to measure speaker similarity in the synthetic speech.
  • Multiple Speaking Styles if Applicable: Some advanced assistants can adjust speaking style, speed, or emotion. In such cases, evaluation should test each style. For instance, if a TTS can speak both neutrally and excitedly, include prompts where an excited tone is expected to see if the TTS conveys it. Similarly, test different speeds or volumes if the assistant will use them. The evaluation set effectively becomes a script that probes all important aspects of the TTS system’s capability.
  • Listener Panel Diversity: While not part of the dataset per se, when preparing for subjective evaluation (like MOS tests), plan to recruit a diverse set of listeners. Make sure the group of evaluators (or crowd-workers) includes people from the target user demographic, and enough listeners to get reliable scores. Standard practice in research is to have dozens of listeners each rate many samples, to average out individual subjective biasesarxiv.orgarxiv.org. Also, ensure listeners have good headphones and a quiet environment or specify listening conditions as part of the test design. By crafting a text corpus that is wide-ranging yet relevant, and by planning a solid evaluation procedure (including reference recordings and diverse listeners), one can thoroughly assess a TTS system’s quality in the context of a companion assistant. The goal is to verify that the synthetic voice will sound clear, natural, and appropriate across all the content it needs to speak.

Key TTS Performance Metrics (MOS, Intelligibility, Latency)

Evaluating TTS involves both subjective measures (human judgment of quality) and objective measures (automated or task-based metrics). The most important metrics include:

  • Mean Opinion Score (MOS): MOS is the classic metric for speech quality, widely used to rate TTS output naturalness. In a MOS test, human listeners hear samples of the synthetic speech and rate them on a scale (typically 1 to 5, where 5 means the speech sounds completely natural and pleasant, and 1 means it is poor quality). MOS provides an average quality score for the system. It captures aspects like the naturalness of the voice, clarity, and absence of distortionssmallest.ai. A high MOS (closer to 5) indicates the TTS is nearly indistinguishable from a human speaker, whereas a lower MOS indicates the speech may sound robotic, muffled, or unnatural. For example, modern neural TTS systems (like WaveNet and beyond) often achieve MOS in the 4.0+ range, approaching human speech which would score around 4.5–5.0arxiv.org. MOS testing is considered an industry standard for TTS evaluation – virtually every research paper or product update on TTS will report MOS to demonstrate quality improvements. However, MOS tests are subjective and can be influenced by the test setup. Best practices (per research guidelines) include: ensuring enough listeners and ratings per sample, clearly defining the rating question (e.g. “rate the naturalness of this speech on a 1-5 scale”), and performing statistical analysis to confirm differences are significantarxiv.orgarxiv.org. It’s also common to report confidence intervals for MOS scores, given the variability of human opinions.

  • Other Listening Test Paradigms: In addition to MOS, evaluators use alternative subjective tests to get more detailed comparisons. AB tests ask listeners to pick which of two samples is better (or if they have no preference), which can be more sensitive when differences are small. ABX tests have listeners compare two samples and a reference (X) – often used to test if X is closer to A or B (for example, is the synthetic voice closer to the original speaker’s voice or not). MUSHRA tests (a scaled comparison test) are used in audio coding and sometimes adapted for TTS by rating multiple samples on a scale in direct comparison, though true MUSHRA requires a hidden reference and anchor which aren’t always applicable for TTSarxiv.org. Another method is Best-Worst Scaling (BWS) where listeners rank a set of samples from best to worstarxiv.org. These methods can complement MOS. In fact, one best practice guideline suggests that if only a basic MOS is reported, authors should also analyze why one system is better than another, because MOS alone can become “saturated” at high levelsarxiv.org. For a companion assistant, AB tests might be used during development to quickly decide between two TTS voices (“Which do users prefer, Voice A or Voice B?”).

  • Intelligibility Metrics: A crucial aspect of TTS (especially for an assistant giving information) is that the synthesized speech is intelligible – the user can understand every word. Intelligibility can be measured by having listeners transcribe what they hear or answer questions about the content. For instance, an intelligibility test might use semantically unpredictable sentences (so the listener can’t guess words from context) and then calculate the WER of the listener’s transcriptions against the original text. This was done in the Blizzard Challenge evaluations, where some systems’ intelligibility was compared by WER on listener transcriptions (e.g., one system had 17.7% WER in the SUS intelligibility test, meaning listeners misheard roughly 17% of the words)www.isca-archive.org. A lower WER in this context means the TTS is easier to understand. Intelligibility is often near 100% for today’s high-quality TTS on clear sentences, but it can drop if the TTS has glitches, unnatural prosody, or if the content has difficult words (for example, spelling out an email address). In industry, a proxy for intelligibility is to run a strong ASR on the TTS output and see if it transcribes it correctly – essentially treating the TTS+ASR loop as an automatic intelligibility check. While not perfect, this automated method is sometimes used in regression testing to catch any new TTS issue that causes words to become garbled (if the ASR’s WER on TTS output spikes, there may be a problem).

  • Latency and Throughput: Just like ASR, TTS must also operate within tight time constraints for a smooth dialogue. TTS latency is the time from receiving text to the audio being ready to play. A companion assistant should respond quickly after the user stops speaking, which means the TTS likely has only a fraction of a second to start producing speech. Modern TTS systems often measure latency in terms of real-time factor (RTF) – if RTF = 0.5, it means generating 1 second of speech takes 0.5 seconds of processing (twice as fast as real-time). Cloud TTS APIs usually report an end-to-end latency; for instance, one benchmarking report compared two TTS engines and found they could generate complete responses in a few hundred milliseconds on average (around 330–340 ms)smallest.ai. Those sub-second latencies are suitable for real-time voice assistants. To evaluate latency, developers measure the distribution of TTS response times for a set of test phrases or use automated tools to log time-to-speech. Some industry guidelines suggest targets like under 200 ms for short prompts to be “instantaneous,” though longer utterances will take more time. The key is that the first audio should be heard quickly (low time-to-first-frame), and overall the TTS should not introduce noticeable delay in conversation. In addition to raw speed, streaming TTS is an emerging approach: instead of waiting to generate the full sentence audio, the system can start outputting audio while still computing the rest. This can effectively hide latency for longer outputs and is something to consider when evaluating real-world responsiveness.

  • Voice Naturalness and Expressiveness: These aspects are usually covered by MOS, but sometimes more granular metrics or tests are used. For example, if the assistant’s voice should convey emotions or a persona, evaluators might specifically ask listeners to rate the expressiveness or appropriateness of emotion in the speech. There is no single numeric metric for “expressiveness,” but one can design a Likert-scale opinion test for it or use qualitative feedback. In research, some papers use separate MOS scores for attributes like “speaker similarity” (how much the TTS voice matches a target voice) or “prosody naturalness.” In an assistant context, one might measure how well the TTS handles emphasis – e.g., does it stress the correct words in a question. These are specialized evaluations depending on the use case of the assistant. In practice, MOS remains the cornerstone metric for TTS quality, often supplemented by intelligibility checks and other subjective tests. A high MOS indicates the TTS is generally pleasing to listenerssmallest.ai, but developers will also pay attention to specific problem cases (such as mispronunciations of certain words, or failure to render the correct tone for a question vs. a statement). By combining MOS results with targeted tests (like intelligibility or AB comparisons), one can diagnose not just how good the TTS sounds overall, but where it might need improvement. Additionally, tracking latency ensures the TTS is not only good-sounding but also fast enough for interactive use. In summary, a well-rounded TTS evaluation assesses naturalness (MOS), intelligibility (can users understand it), and efficiency (latency/RTF), covering both the perceptual quality and the performance of the speech synthesis.

TTS Evaluation Standards and Real-World Methodologies

In industry and academia, subjective listening tests following established standards are the norm for TTS evaluation. A common standard is derived from ITU-T recommendations (like P.800) for conducting MOS tests, which provide guidelines on how to collect opinions reliably. For example, tests are typically double-blind (listeners don’t know which system produced which sample), and samples are randomized to avoid bias. It’s also standard to normalize audio playback volume, since volume differences can skew perception of qualityarxiv.org. These practices ensure that MOS scores from different experiments or papers are as fair and comparable as possible. In academic competitions like the Blizzard Challenge, all participant TTS systems are evaluated by the same panel of listeners under the same conditions. These evaluations often report multiple metrics: MOS for naturalness, MOS for voice similarity to a target speaker, and a word error rate for intelligibility, among otherswww.isca-archive.org. The Blizzard Challenge has noted over the years that some top TTS systems achieved MOS scores indistinguishable from human recordings on certain testspapers.ssrn.com, which shows how far the technology has come.

Real-world product evaluations of TTS (for a voice assistant’s voice) often involve both expert listening panels and crowd-sourced user testing. Companies like Google, Amazon, and Apple routinely run internal studies where dozens of people listen to new TTS versions and rate them against the current version. If the new model doesn’t achieve a significantly higher MOS or preference score, it might not be deployed. For instance, when Amazon introduced a new voice for Alexa or when Google launched its WaveNet voices, they reported MOS improvements approaching human quality to validate the upgrade. In addition, these companies pay attention to qualitative feedback from users – e.g., monitoring forums or feedback channels for complaints about pronunciation or tone, which then feed into evaluation criteria for the next iteration.An industry case study in TTS evaluation can be seen in a public benchmark comparing two TTS services (Smallest.ai and ElevenLabs). They focused on two critical metrics: latency and MOS, to reflect real-time voice assistant needssmallest.aismallest.ai. In their tests, both engines were evaluated on the same set of sentences, and MOS was obtained for speech quality. Interestingly, they used modern techniques to estimate MOS: rather than a large listening test, they employed open-source MOS prediction models (like Wav2Vec-based MOS predictors) to automate the quality scoringsmallest.ai. This allowed them to efficiently benchmark many samples. The results showed one system had a clear edge in latency and a measurable advantage in MOS, demonstrating how even small MOS differences can be meaningful when averaged over many samplessmallest.aismallest.ai. This example illustrates a practical methodology: use a well-chosen set of test sentences, measure latency objectively, and use either human raters or reliable MOS prediction models to assess quality, then compare systems side by side. Transparency in such evaluations (sharing test data and methods) is important for credibilitysmallest.ai.

When deploying a TTS voice for a companion assistant, teams also consider long-form evaluations and user feedback. A voice that scores well on short prompts might still have issues in longer monologues (like reading a news briefing) – maybe it becomes monotonous or the prosody issues become more noticeable over time. Thus, some evaluations involve listening to the TTS over several minutes to ensure it remains pleasant. Additionally, since the ultimate judges are end-users, companies sometimes do A/B tests in production: a fraction of users get the new TTS voice and their engagement or satisfaction ratings are monitored relative to the old voice. If users with the new voice interact longer or report higher satisfaction, that’s strong evidence of improvement beyond the lab metrics.To sum up, evaluating TTS for a voice assistant adheres to industry standards like MOS testing for quality and latency measurement for performance. It also requires careful test design (covering various content and conditions) and often a combination of subjective listening tests and objective checks. Case studies from both academia (e.g., Blizzard Challenge) and industry show the importance of multiple evaluation dimensions: great TTS systems are those that score well in naturalness, are intelligible, and meet real-time constraints. As a best practice, any changes to the TTS should be validated on a representative evaluation set and ideally with real user trials to ensure the voice truly enhances the companion assistant’s user experience.

Conclusion

ASR and TTS evaluation in companion AI assistants is a multi-faceted process that blends scientific rigor with practical user-centric considerations. By preparing diverse and representative evaluation sets, we ensure that tests mirror real-world usage. By tracking key performance metrics – accuracy (WER) and latency for ASR, naturalness (MOS) and intelligibility for TTS, among others – we can quantify improvements and regressions. And by following industry standards and proven methodologies, such as subjective listening tests for TTS or end-to-end voice assistant simulations, we gain meaningful insights into how these systems will perform for users.Real-world case studies underscore that metrics should align with user experience: companies have extended beyond raw WER to semantic error detectionwww.amazon.scienceor readability-focused error ratesmachinelearning.apple.com, and they validate TTS not just with MOS, but with direct user feedback and A/B tests. In practice, the best evaluation strategy is one that is holistic – it considers the component-level performance of ASR and TTS as well as their impact on the overall assistant interaction. By adhering to these best practices and continually refining our evaluation methods, we can reliably improve ASR and TTS systems, bringing voice-enabled AI assistants ever closer to natural, seamless communication.