Speech Technology group

SECURING YOUR VOICE – VOICE BIOMETRICS ON THE RISE

speechtechnologygroup_r4qjtz — Sun, 07 Mar 2021 10:40:00 +0000

Voice biometrics and voice verification technology is getting more attention in the industry. Researchers turn voiceprints into passwords to avoid storing your actual speech anywhere.

Voice authentication is increasingly used by tens of millions of people, including bank and telecom customers: you record a sample upon enrollment, and then speak that passage each time you call in, confirming your identity with a certainty regular passwords can’t match. But if hackers obtain your voiceprint under scenarios akin to breaches of credit-card and other personal data they could use it to break into other systems that use voice authentication.

Now researchers at Carnegie Mellon University say they’ve developed voice-verification technology that can transform your voice into a series of password-like data strings, in a process that can be handled on the average smartphone. Your actual voice never leaves your phone, during enrollment or later authentication.

“We are the first to convert a voice recording to something like passwords,” says Bhiksha Raj, the CPU computer scientist who led the research. “With fingerprints, this is exactly what is done, but nobody has figured out how to do it with voice until now.” The work will be presented as a keynote speech at an information security conference in Passau, Germany next month.

The technology handles the slight differences in the way people speak from day to day by making multiple password-like data strings using different mathematical functions. By comparing how many of those match, it can determine whether the speaker is the person who enrolled. “The key to making it work is that instead of converting it to just one password, we convert it to a large collection of them,” Raj says.

The technology also throws in a dash of extra data specific to your phone, so that “nobody else besides you, using your smartphone, can generate the specific strings that you did,” he says. Then it encrypts those data strings for their journey across the network.

The CMU system is accurate 95 percent of the time using a test dataset. (Errors would simply require a speaker to repeat the authentication process.) That’s not quite as good as commercial systems that use stored voiceprints, but the technology is still being honed, and improvements are expected, says Raj. He adds that the method, though still in the research phase, is computationally efficient enough to work on most smartphones.

Other research efforts to protect voice privacy invoice verification have tried to work with encrypted versions of voice files without ever decrypting them. (See “Homomorphic Encryption.”) But that method takes so much computational horsepower that it’s “currently impractical,” says Shantanu Rane, principal research scientist at Mitsubishi Electric Research Laboratories in Cambridge, Massachusetts. Raj’s technology “works fast while giving reasonable verification accuracy,” Rane adds.

Other groups are working on different methods to protect voice privacy. For the speech recognition used by Apple‘s Siri app, researchers at BBN, of Cambridge, Massachusetts, have proposed only sending certain features of your voice to Apple (see “Wiping Away Your Siri Fingerprint”), rather than the voice itself.

It might take only take one bad voice-data breach to shock users and shake the industry, says Prem Natarajan, ex ecutive vice president at Raytheon BBN Technologies in Cambridge, Massachusetts. “Privacy-preserving speech processing, including for voice verification, is likely to be of increasing importance” given the surging popularity of voice interfaces, he says (see “Where Speech Recognition is Going”). “I would like nothing more than to be able to carry only one password with me – my voice.”

More-latest speech technologies
Social share or comment – what do you think?

The post SECURING YOUR VOICE – VOICE BIOMETRICS ON THE RISE first appeared on Speech Technology group.

FUTURIST RAY KURZWEIL WANTS TO MOVE YOUR BRAIN INTO THE CLOUD

speechtechnologygroup_r4qjtz — Mon, 14 Dec 2020 20:57:00 +0000

Here’s a way you can achieve 100% accuracy without touchtone technology by allowing the caller to speak to your IVR. It’s a mix of speech recognition and agent-assisted IVR technology. Source: google.com

If you’ve ever used an interactive voice response (IVR) system – and unless you’ve been living on a desert island, you have – you’ll know that there are some truly bad systems out there. The earliest voice-driven IVRs that used speech recognition often required users to shout into the telephone, repeating themselves until they nearly spontaneously combusted in frustration when the system returned the message, “I didn’t understand your response” for the sixth time.

Luckily, technology has come a long way.

Once upon a time, the number of response combinations that the systems could understand was very limited, which is why it had to restrict your responses (“say ‘one’ for the customer service department” instead of simply asking you to describe what you’re looking for in natural language). Those days are nearly behind us, thanks to newer solutions offered by companies such as Massachusetts-based Interactions, which offers a conversational natural language solution that allows people to speak to computers as if they were live agents.

Speech Technology Group (www.speechtechnologygroup.com) offers an agent-assisted IVR system, that works in conjunction with the powerful Microsoft speech engine. The combination of the two offers the highest accuracy and extraordinary value.

Interactions’ solution leverages a combination of automated speech recognition (ASR) and what the company calls “human-assisted understanding” (HAU). HAU improves accuracy and natural-language understanding by supplementing speech recognition when it can’t perform. In traditional speech-recognition applications, all requests get routed directly to an ASR engine. When the engine can’t recognize something, it keeps re-prompting the caller, or eventually gives up and transfers the call to a live agent. This limitation causes poor application design and performance – and frustrates callers. Interactions says it has overcome this application-design limitation.

Rachel Metz of MIT (News – Alert) Technology Review says it’s about more than simply routing calls with less frustration.

“Interactions’ software is, hopefully, more than a solution to impossibly annoying automated support systems,” writes Metz. “It’s also an example of software and human intelligence working together. Rather than relying entirely on software to handle calls, Interactions automatically hands speech that its software can’t cope with over to human agents, who select an appropriate response.”

Who would have thought that humans interacting vocally with computers could be a source of anything but a nervous breakdown?

More-latest speech technologies

The post FUTURIST RAY KURZWEIL WANTS TO MOVE YOUR BRAIN INTO THE CLOUD first appeared on Speech Technology group.

Google explains how more data means better speech recognition

speechtechnologygroup_r4qjtz — Mon, 04 Nov 2019 21:08:17 +0000

Will Google’s almost infinite access to data eventually give them an edge over Apple when it comes to speech recognition performance? Source GigaOM

A new research paper from Google highlights the importance of big data in creating consumer-friendly services such as voice search on smartphones. More data helps train smarter models, which can then better predict what someone says next — letting you keep your eyes on the road.

A research paper out of Google describes in some detail the data science behind the company’s speech recognition applications, such as voice search and adding captions or tags to YouTube videos. And although the math might be beyond most people’s grasp, the concepts are not. The paper underscores why everyone is so excited about the prospect of “big data” and also how important it is to choose the right data set for the right job.

Google has always been a fan of the idea that more data is better, as exemplified by Research Director Peter Norvig’s stance that, generally speaking, more data trumps better algorithms (see, e.g., his 2009 paper titled “The Unreasonable Effectiveness of Data“). Although some hair-splitting does occur about the relative value (or lack thereof) of algorithms in Norvig’s assessment, it’s pretty much an accepted truth at this point and drives much of the discussion around big data. The more data your models have from which to learn, the more accurate they become — even if they weren’t cutting-edge stuff, to begin with.

No surprise, then, it turns out that more data is also better for training speech-recognition systems. The researchers found that data sets and larger language models (here’s a Wikipedia explanation of the n-gram type involved in Google’s research) result in fewer errors predicting the next word based on the words that precede it. Discussing the research in a blog post, Google research scientist Ciprian Chelba gives the example that a good model will attribute a higher probability to “pizza” as the next word than to “granola” if the previous two words were “New York.” When it comes to voice search, his team found that “increasing the model size by two orders of magnitude reduces the [word error rate] by 10% relative.”

The real key, however — as any data scientist will tell you — is knowing what type of data is best to train your models, whatever they are. For the voice search tests, the Google researchers used 230 billion words that came from “a random sample of anonymized queries from google.com that did not trigger spelling correction.” However, because people speak and write prose differently than they type searches, the YouTube models were fed data from transcriptions of news broadcasts and large web crawls.

“As far as language modeling is concerned, the variety of topics and speaking styles makes a language model built from a web crawl a very attractive choice,” they write.

This research isn’t necessarily groundbreaking but helps drive home the reasons that topics such as big data and data science get so much attention these days. As consumers demand ever smarter applications and more frictionless user experiences, every last piece of data and every decision about how to analyze it matters.

More-latest speech tecnologies

The post Google explains how more data means better speech recognition first appeared on Speech Technology group.

IVRS WITHOUT FRUSTRATION: SPEECH RECOGNITION GETS THE HUMAN TOUCH

speechtechnologygroup_r4qjtz — Wed, 12 Jun 2019 21:03:36 +0000

Here’s a way you can achieve 100% accuracy without touchtone technology by allowing the caller to speak to your IVR. It’s a mix of speech recognition and agent-assisted IVR technology.

Luckily, technology has come a long way.

Once upon a time, the number of response combinations that the systems could understand was very limited, which is why it had to restrict your responses (“say ‘one’ for the customer service department” instead of simply asking you to describe what you’re looking for in natural language). Those days are nearly behind us, thanks to newer solutions which offer a conversational natural language solution that allows people to speak to computers as if they were live agents.

These types of systems leverage a combination of automated speech recognition (ASR) and “human-assisted understanding”, which improve accuracy and natural-language understanding by supplementing speech recognition when it can’t perform. In traditional speech-recognition applications, all requests get routed directly to an ASR engine. When the engine can’t recognize something, it keeps re-prompting the caller, or eventually gives up and transfers the call to a live agent. This limitation causes poor application design and performance – and frustrates callers. This approach helps you to overcome this application-design limitation.

This is an example of software and human intelligence working together. Rather than relying entirely on software to handle calls, these type of systems automatically hand speech that the ASR can’t cope with over to human agents, who select an appropriate response.

Who would have thought that humans interacting vocally with computers could be a source of anything but a nervous breakdown?

More-latest speech technologies

The post IVRS WITHOUT FRUSTRATION: SPEECH RECOGNITION GETS THE HUMAN TOUCH first appeared on Speech Technology group.

Hands down the most advanced Text-to-Speech app available

speechtechnologygroup_r4qjtz — Wed, 23 May 2018 21:11:30 +0000

VoiceDream Reader has pulled out all the stops to offer the best user experience for listening to and reading along books and a variety of other content using the latest Text-To-Speech technology.

The quality of the TTS voices is highly natural, particularly the new Julie, Paul, Kate and Bridget voices, which have been available now for the past few months.

Altogether there are 78 Languages available. The recently added Chinese and Japanese voices Hui, Liang, Misaki and Show have received the highest quality ratings by native speakers.

As if this wouldn’t be enough, the user can also create custom pronunciations and adjust the speed, pitch and volume of the voices individually. Custom pronunciations can be very handy for abbreviations and words that are not pronounced the way the phonetic rules suggest.

Even though the voices come already with a huge database of custom pronunciations for proper names like streets and first/last names, there can be always a word you might want to adjust how it is pronounced.

The process is simple. Just type the word in the intuitive user interface and then spell out the word the way you want it to sound. That’s it – the next time this word comes up it will sound just the way you like it.

Variable speech rates can be set by content or voice, which comes in really handy if you want to “speed listen” to a certain type of content or if you want to enjoy the pleasant sound of the voices reading your favorite piece of content back to you.

Gesture driven navigation to move around within the content and control the application makes for a super easy interface for every type of user and environment. The latest version introduces the intriguing concept of navigation units.

‘Navigation units can be set to Sentence, Paragraph, Page, Chapter, Bookmark, Highlight, 15, 30, or 60 seconds — basically all important markers in the text. In the previous version, the rewind and fast forward buttons go backward or forward by 30 seconds. In the new version, they move the speech cursor backward or forward by any Navigation Unit you set. For example, you can go to the next page in a PDF document or DAISY eBook. Or, go through all your highlighted text one by one. And you can set a Navigation Unit on the fly by tapping on the rewind or fast-forward button and hold.”

Even though the app now offers an amazing number of useful features, the user interface remains very clean and easy to use.

If you use services like Instapaper, Pocket, Evernote, Dropbox or Google Drive, you can simply connect them with VoiceDream reader, which pulls the content you stored there into the app and now you can read and listen to all of your favorite content right within VoiceDream Reader.

With all of the “bells and whistles” that the app provides, the Bookshare integration brings listening to your favorite book to a whole new level.

For a combination of visual and voice reading, you can scroll by page or use free scrolling and follow along the highlighted words which are presented in customizable font sizes all the way up to 80 pt. when you stop or pause, the App remembers the voice and visual position when you resume reading.

A number of the new features that have been implemented were suggested by enthusiastic users who wanted to make “their app” even better.

Having a good application idea and listening to your customers made this app truly a gem and created an ever-growing loyal user base.

Click here and download the app. You won’t regret it!

More-latest speech technologies

The post Hands down the most advanced Text-to-Speech app available first appeared on Speech Technology group.

An Introduction to Text-to-Speech Synthesis

speechtechnologygroup_r4qjtz — Tue, 12 Sep 2017 21:14:21 +0000

An Introduction to Text-to-Speech Synthesis is a comprehensive introduction to the subject. The author treats two areas of speech synthesis: Part I of the book concerns natural language processing and the inherent problems it presents for speech synthesis; Part II focuses on digital signal processing, with an emphasis on the concatenative approach. Both parts of the text guide the reader through the material in a step-by-step easy-to-follow way. This is the first book to treat the topic of speech synthesis from the perspective of two different engineering approaches. The book will be of interest to researchers and students in phonetics and speech communication, in both academia and industry.

The post An Introduction to Text-to-Speech Synthesis first appeared on Speech Technology group.

MEET THE COMIC WHO DOESN’T SAY A WORD

speechtechnologygroup_r4qjtz — Tue, 17 Jan 2017 20:59:16 +0000

At first, the laughs are, perhaps understandably, nervous. But then Lee Ridley is far from a conventional stand-up comic.

For a start, he doesn’t utter a single word on stage.

Lee, who has cerebral palsy, cannot speak so he uses a text-to-speech iPad app to deliver his lines. During his show, Lost Voice Guy, lines are delivered in the synthetic tones of a computer-generated male voice.

“When I realized I’d never be able to talk again, I was speechless,” he jokes, to chuckles from the audience.

In a recent stunt at the UK’s X Factor auditions, he used an iPad to deliver a rendition of R Kelly’s “I Believe I Can Fly”. Unfortunately, the judges didn’t see the funny side, and he was cut short after a few verses.

Still, the experience has made ripe material for his act: “I used to be in a disabled Steps tribute band. We were called Ramps. We faced an uphill struggle.” At this, the audience erupts with laughter.

Over the weekend, Lee was in London for a gig in Covent Garden. Our interview is conducted via a specialized computer called a Lightwriter rather than the iPad app, which is “more fun for the show but harder to type on – this is what I use in everyday life”, says Lee.

The Lightwriter is portable and has two screens so I can see the words as Lee writes them. I ask a question and after a few taps of the keypad, the computer voice replies.

He says: “I’ve always loved stand-up but I never thought about trying it myself until friends suggested that it might work and that I’d be unique.

“I’m comfortable making fun of myself. People don’t expect it and there’s that awkward feeling in the room initially when I get on stage. But when I’m funny that goes away.”

Lee’s first gig was at a friend’s comedy night in Sunderland in February and he has been overwhelmed at the positive response so far – he recently supported Ross Noble on tour while “Little Britain” star Matt Lucas is a fan. Now he is planning a tour of his own next year.

“I was very nervous. I thought no one would understand me,” explains Lee. “After a few minutes of it going well, I started to enjoy it. It was a massive buzz knowing people were laughing at stuff I’d written. I managed only two hours’ sleep afterward because I was on such a high.”

Cerebral palsy is an umbrella term covering a range of neurological conditions that affect movement, coordination and speech. It is caused by damage to the brain that can happen during pregnancy, birth or soon after.

Aged six months, Lee contracted the brain infection encephalitis – triggered by a cold sore – which put him in a coma for two weeks. It left him with hemiplegia, a common type of cerebral palsy.

Lee’s right side is much weaker than his left, which means he walks with a limp. He finds it hard to swallow and the muscles in his mouth and tongue are too weak for him to speak – despite years of working with a speech therapist in early childhood.

About three-quarters of people with cerebral palsy suffer from some sort of speech difficulty. Lee, who grew up in Co Durham in England, attended a specialist primary school where he learned sign language.

Aged 12, he was given his purpose-built Lightwriter. Despite its American accented voice, you can still detect a hint of Geordie in his syntax and with words such as Mam, which he spells phonetically.

“I take the Lightwriter for granted now but it changed my life and made me a lot more independent,’ says Lee, who works at Newcastle City Council’s press office.

“It’s frustrating that I can’t just instantly say something, that I have to type it out first – although if I’m angry that can be a good thing.”

Despite his disability, teachers at his primary school realized Lee was very bright and needed to be challenged. When he was old enough, he was able to attend many lessons at a mainstream secondary school and he went on to gain a place at the University of Central Lancashire to study journalism.

“Luckily I had a really good English teacher who pushed me to my limits. I’ve always loved writing and English and until now journalism was all I ever wanted to do,’ says Lee, who has had stints as a sports reporter for his local newspaper and the BBC. “I always seem to choose strange careers for someone who can’t speak.”

He is able to do most of his research via email but uses his Lightwriter device to conduct interviews over the phone.

“Some people just assume I’m an answering machine or get impatient but I’m used to it and it’s turned into great comedy material,’ he says.

More-latest speech technologies

The post MEET THE COMIC WHO DOESN’T SAY A WORD first appeared on Speech Technology group.