The value of good speech to text.
Capturing What is Said, By Esther Shein
Commissioned by CACM Staff, March 23, 2023
A very basic flow chart for the conversion of speech to text.
New AI-enabled capabilities for speech-to-text systems include taking actions based on a transcript, prompting someone to ask a follow-up question, and summarizing a conversation at the end of a call, said Christine McAllister at Forrester Research.
ChatGPT and generative artificial intelligence (AI) may be having a moment, but don't underestimate the value of speech-to-text transcription, sometimes referred to as automatic speech recognition (ASR) software, which continues to improve.
ASR technology converts human speech into text using machine learning and AI. There are two types: synchronous transcription, which is typically used in chatbots, and asynchronous, where transcription occurs after the fact to capture customer/agent conversations, notes Cobus Greyling, chief evangelist at HumanFirst, which makes a productivity suite for natural language data.
ASR made some waves in recent months with the announcement of Whisper from OpenAI, the organization that created ChatGPT. Whisper was trained on 680,000 hours of multilingual and supervised data collected from the Web. OpenAI claims that large and diverse dataset has improved the accuracy of the text it produces; the company says Whisper also can transcribe text from speech in multiple languages.
"What that means is that it's extremely accurate—right off the top—without much tuning or training,'' says Christina McAllister, a senior analyst at research and advisory company Forrester Research. "The large language model aspect, which is based on huge amounts of data, is what's new and is the most innovative aspect of the ASR market today,'' she says.
Because of its ability to transcribe meetings and interviews more efficiently and accurately, one of the broadest enterprise use cases for speech-to-text is in customer call centers. The next phase in the development of ASR is to use artificial intelligence to analyze call center conversations for customer sentiment and to validate compliance in regulated industries, according to Annette Jump, a vice president analyst at Gartner.
The benefits of ASR in the call center context are its ability to identify customer problems early and to improve customer satisfaction by resolving issues sooner, says Jump.
Other use cases include generating closed captions for movies, television, video games, and other forms of media. ASR is widely used in healthcare by physicians to convert dictated clinical notes into electronic medical records.
Speech vendors typically leverage a third-party ASR engine so they don't have to build their own, McAllister says. That frees them up so they can "do all the rest of their magic from the transcript point forward,'' she says.
Some of the new AI capabilities for speech-to-text systems include taking actions based on a transcript, prompting someone when it's appropriate to ask a follow-up question, and summarizing a conversation at the end of a call, McAllister says.
One frequently used AI-powered speech-to-text transcription service is Otter.ai, which has added capabilities aimed at improving meetings, including integration with collaboration tools such as Zoom and Microsoft Outlook. ... '