Speech Regonition

Tens of millions of people use smart speakers and their voice software to play games, find music or trawl for trivia. Millions more are reluctant to invite the devices and their powerful microphones into their homes out of concern that someone might be listening.

Sometimes, someone is.
Amazon.com Inc. employs thousands of people around the world to help improve the Alexa digital assistant powering its line of Echo speakers. The team listens to voice recordings captured in Echo owners’ homes and offices. The recordings are transcribed, annotated and then fed back into the software as part of an effort to eliminate gaps in Alexa’s understanding of human speech and help it better respond to commands. 
The Alexa voice review process, described by seven people who have worked on the program, highlights the often-overlooked human role in training software algorithms. In marketing materials Amazon says Alexa “lives in the cloud and is always getting smarter.” But like many software tools built to learn from experience, humans are doing some of the teaching.
The team comprises a mix of contractors and full-time Amazon employees who work in outposts from Boston to Costa Rica, India and Romania, according to the people, who signed nondisclosure agreements barring them from speaking publicly about the program. They work nine hours a day, with each reviewer parsing as many as 1,000 audio clips per shift, according to two workers based at Amazon’s Bucharest office, which takes up the top three floors of the Globalworth building in the Romanian capital’s up-and-coming Pipera district. The modern facility stands out amid the crumbling infrastructure and bears no exterior sign advertising Amazon’s presence.

The work is mostly mundane. One worker in Boston said he mined accumulated voice data for specific utterances such as “Taylor Swift” and annotated them to indicate the searcher meant the musical artist. Occasionally the listeners pick up things Echo owners likely would rather stay private: a woman singing badly off key in the shower, say, or a child screaming for help. The teams use internal chat rooms to share files when they need help parsing a muddled word—or come across an amusing recording.

Amazon in Bucharest
Amazon has offices in this Bucharest building.
Photographer: Irina Vilcu/Bloomberg

Sometimes they hear recordings they find upsetting, or possibly criminal. Two of the workers said they picked up what they believe was a sexual assault. When something like that happens, they may share the experience in the internal chat room as a way of relieving stress. Amazon says it has procedures in place for workers to follow when they hear something distressing, but two Romania-based employees said that, after requesting guidance for such cases, they were told it wasn’t Amazon’s job to interfere.

“We take the security and privacy of our customers’ personal information seriously,” an Amazon spokesman said in an emailed statement. “We only annotate an extremely small sample of Alexa voice recordings in order [to] improve the customer experience. For example, this information helps us train our speech recognition and natural language understanding systems, so Alexa can better understand your requests, and ensure the service works well for everyone.

“We have strict technical and operational safeguards, and have a zero tolerance policy for the abuse of our system. Employees do not have direct access to information that can identify the person or account as part of this workflow. All information is treated with high confidentiality and we use multi-factor authentication to restrict access, service encryption and audits of our control environment to protect it.”

Amazon, in its marketing and privacy policy materials, doesn’t explicitly say humans are listening to recordings of some conversations picked up by Alexa. “We use your requests to Alexa to train our speech recognition and natural language understanding systems,” the company says in a list of frequently asked questions.

In Alexa's privacy settings, Amazon gives users the option of disabling the use of their voice recordings for the development of new features. The company says people who opt out of that program might still have their recordings analyzed by hand over the regular course of the review process. A screenshot reviewed by Bloomberg shows that the recordings sent to the Alexa reviewers don’t provide a user’s full name and address but are associated with an account number, as well as the user’s first name and the device’s serial number.

The Intercept reported earlier this year that employees of Amazon-owned Ring manually identify vehicles and people in videos captured by the company’s doorbell cameras, an effort to better train the software to do that work itself.

“You don’t necessarily think of another human listening to what you’re telling your smart speaker in the intimacy of your home,” said Florian Schaub, a professor at the University of Michigan who has researched privacy issues related to smart speakers. “I think we’ve been conditioned to the [assumption] that these machines are just doing magic machine learning. But the fact is there is still manual processing involved.”

“Whether that’s a privacy concern or not depends on how cautious Amazon and other companies are in what type of information they have manually annotated, and how they present that information to someone,” he added.

When the Echo debuted in 2014, Amazon’s cylindrical smart speaker quickly popularized the use of voice software in the home. Before long, Alphabet Inc. launched its own version, called Google Home, followed by Apple Inc.’s HomePod. Various companies also sell their own devices in China. Globally, consumers bought 78 million smart speakers last year, according to researcher Canalys. Millions more use voice software to interact with digital assistants on their smartphones.

Alexa software is designed to continuously record snatches of audio, listening for a wake word. That’s “Alexa” by default, but people can change it to “Echo” or “computer.” When the wake word is detected, the light ring at the top of the Echo turns blue, indicating the device is recording and beaming a command to Amazon servers.

Inside An Amazon 4-Star Store
An Echo smart speaker inside an Amazon 4-star store in Berkeley, California.
Photographer: Cayce Clifford/Bloomberg

Most modern speech-recognition systems rely on neural networks patterned on the human brain. The software learns as it goes, by spotting patterns amid vast amounts of data. The algorithms powering the Echo and other smart speakers use models of probability to make educated guesses. If someone asks Alexa if there’s a Greek place nearby, the algorithms know the user is probably looking for a restaurant, not a church or community center.

But sometimes Alexa gets it wrong—especially when grappling with new slang, regional colloquialisms or languages other than English. In French, avec sa, “with his” or “with her,” can confuse the software into thinking someone is using the Alexa wake word. Hecho, Spanish for a fact or deed, is sometimes misinterpreted as Echo. And so on. That’s why Amazon recruited human helpers to fill in the gaps missed by the algorithms.

Apple’s Siri also has human helpers, who work to gauge whether the digital assistant’s interpretation of requests lines up with what the person said. The recordings they review lack personally identifiable information and are stored for six months tied to a random identifier, according to an Apple security white paper. After that, the data is stripped of its random identification information but may be stored for longer periods to improve Siri’s voice recognition.

At Google, some reviewers can access some audio snippets from its Assistant to help train and improve the product, but it’s not associated with any personally identifiable information and the audio is distorted, the company says. 

A recent Amazon job posting, seeking a quality assurance manager for Alexa Data Services in Bucharest, describes the role humans play: “Every day she [Alexa] listens to thousands of people talking to her about different topics and different languages, and she needs our help to make sense of it all.” The want ad continues: “This is big data handling like you’ve never seen it. We’re creating, labeling, curating and analyzing vast quantities of speech on a daily basis.”

Amazon’s review process for speech data begins when Alexa pulls a random, small sampling of customer voice recordings and sends the audio files to the far-flung employees and contractors, according to a person familiar with the program’s design.

Amazon.com Inc. Holds Product Reveal Launch
The Echo Spot
Photographer: Daniel Berman/Bloomberg

Some Alexa reviewers are tasked with transcribing users’ commands, comparing the recordings to Alexa's automated transcript, say, or annotating the interaction between user and machine. What did the person ask? Did Alexa provide an effective response?

Others note everything the speaker picks up, including background conversations—even when children are speaking. Sometimes listeners hear users discussing private details such as names or bank details; in such cases, they’re supposed to tick a dialog box denoting “critical data.” They then move on to the next audio file.

According to Amazon’s website, no audio is stored unless Echo detects the wake word or is activated by pressing a button. But sometimes Alexa appears to begin recording without any prompt at all, and the audio files start with a blaring television or unintelligible noise. Whether or not the activation is mistaken, the reviewers are required to transcribe it. One of the people said the auditors each transcribe as many as 100 recordings a day when Alexa receives no wake command or is triggered by accident. 

In homes around the world, Echo owners frequently speculate about who might be listening, according to two of the reviewers. “Do you work for the NSA?” they ask. “Alexa, is someone else listening to us?” 

— With assistance by Gerrit De Vynck, Mark Gurman, and Irina Vilcu

Source: Bloomberg

Chinese conglomerate Alibaba is one of the world’s largest ecommerce companies, but it’s increasingly turning its attention to artificial intelligence (AI). In March 2017, it launched an AI services division for health care and manufacturing, and in September its public cloud division — Alibaba Cloud — unveiled plans to set up a dedicated subsidiary and produce a self-developed AI inference chip that could be used for logistics and autonomous driving.

Alibaba has its fingers in plenty of AI pies, needless to say. And during a presentation at NeurIPS 2018 in Montreal this morning, it delivered an update on those cross-company efforts.



 "We’re solving … scenarios [with] unseen difficulties,” said Rong Jin, dean of the Alibaba Institute of Data Science. “AI together with innovation [is helping] to solve some interesting challenges.”

One of those challenges is speech recognition in noisy environments, like a crowded subway system or congested convention center. Alibaba’s solution is part hardware, part software: a far-field microphone array and sophisticated deep learning algorithms that isolate voices in a crowd, drastically reducing error rate.

Compared to the 84 percent accuracy the “best” speech recognition technologies are able to achieve with a mic array alone, Alibaba claims its model is between 94 and 95 percent accurate, even with heavily accented speakers. It has already been deployed as part of a voice-based subway ticketing system in Shanghai, and Alibaba is in talks to bring it to “a number of [additional] cities.”

“Nothing can save you if you don’t get enough signal to be recognized in the first place,” Jin said.

The spoken word isn’t the only domain Alibaba is tackling with AI. Using natural language processing, it’s performing automatic translation in real time, in the cloud, so that Alibaba retail customers in countries such as Russia and the Malay region can converse with human agents in their native tongues. And it’s tapping algorithms to field a portion of the tens of thousands of calls its support centers receive each day with AliMe, Alibaba’s intelligent customer service engine.

AliMe, much like Google’s Duplex, can carry on a phone conversation and answer basic questions without involving a human agent. Perhaps more impressively, in a chatbot context, it’s able to automatically extract text and images from a supplied document with “better than human” performance.

In an onstage demo, a customer asked Dian Xiaomi — Alibaba’s answering bot — about sales promotions for a particular Bluetooth speaker, like what sort of free gifts they’d receive with their purchase and how the gifts would be delivered to their residence. (A version rolling out later this year will add sentiment analysis and automated alerts for priority cases.) Another demo showed a humanoid embodiment of the chatbot — a prototype, Jin told the audience — with coordinated eye, lip, and head movements.

It’s a boon for bustling Alibaba divisions like AliExpress, which has over 150 million users and millions of merchants, and Cainiao, whose human workers and robots fulfill more than a billion orders each year. On Singles’ Day — the November 11 Chinese shopping holiday that this year generated $30.8 billion — Alibaba’s agents receive 5 times the typical number of calls in a 24-hour period, which would have been nearly impossible to juggle without a helping hand from AI.

Dian Xiaomi currently serves almost 3.5 million users a day, Alibaba says.

But natural language processing is just the tip of Alibaba’s AI iceberg. On Xian Yu, the retailer’s used goods marketplace, the company deployed a negotiation bot that talks to buyers to settle on a price.

The bot’s development wasn’t a cakewalk — it needed to learn negotiating strategies and efficient ways to generate text that’d incentivize back-and-forth negotiation — but the end result is impressive. When published to 10 million users on the same platform, the bot had a 20 percent higher chance of making a deal than a typical human being.

“Most of the [users] are not professional sellers,” Jin said. “They don’t know how to set a price or talk to buyers.”

On the inventory management and image search front, Alibaba is leveraging a scalable computer vision architecture to sift through hundreds of millions of entities. Its Cloud Image Search algorithm can recognize objects and find images containing similar or identical ones, and one of its store management apps — which picks out multiple items on a shelf to generate a summary that includes a distribution of different brands — can detect more than 100,000 SKUs with “high accuracy.” (Alibaba’s working toward a goal of 10 million SKUs.)

Both complement Alibaba’s Ali Smart Supply Chain (ASSC), a suite of AI tools that help Alibaba merchants forecast product demand, allocate inventory, and select pricing strategies.

Alibaba’s machine vision work extends to satellite images. Using data gathered from AutoNavi, the largest map and navigation provider in China, with over 70 million users, its systems are able to identify recently constructed buildings, for example, and gather information related to road work and points of interest.

Alibaba is also using computer vision to prevent shoplifting. At its more than 66 Hema brick-and-mortar stores, offline algorithms at its self-checkout kiosks prevent ne’er-do-well customers from scanning only the first item in a basket, or concealing items from the overhead camera’s view.

“The goal is to … have a computer vision system figure out if a customer is intentionally or unintentionally scanning items,” Jin said. “The machine sees that things aren’t scanned.”

It’s powered by a deep learning algorithm — AliFPGA-X100 — that runs on a field-programmable gate array, a reconfigurable integrated circuit within the kiosks. Alibaba says it’s able to process images up to 170 times faster compared than a comparable GPU-based system.

Alibaba is also applying AI to Youku, its video hosting service. Machine learning algorithms automatically generate thumbnails for the roughly 200,000 videos its tens of millions of active users upload each day. And it can target certain audience segments with said thumbnails. Female users might see a different preview image for a given video than male users, for example. This has led to a 15 percent improvement in click-through rates and a 12 percent uptick in dwell time.

Today’s survey comes just over a year after the debut of Alibaba’s new research organization — the Academy for Discovery, Momentum, and Outlook (or DAMO)— aimed at tackling emerging technologies, like machine learning and network security, and the opening of labs in San Mateo, Seattle, Moscow, Tel Aviv, and Singapore. It also closely follows the launch of Alibaba’s Tmall Genie, an AI-powered voice assistant that’s sold over 5 million units since it hit store shelves in July 2017.

And the company is arguably just getting started. Alibaba plans to spend more than $15 billion on research and development by 2020, it told Quartz in October 2017.

Read Source Article By  @KYLE_L_WIGGERS  Venture Beat

#AI #Alibaba #SpeechRecognition #ArtificialIntelligence #DataScience #AIInferenceChip

Speech recognition is a new technological development that allows the users to speak and talk to their devices such as computer and mobile phones and give commands and instructions to be fulfilled by the system. The dedicated software recognizes the commands and converts them into a machine-readable format for performing the asked action. The use of other input methods such as typing, selecting options etc. has seen a drastic fall after the introduction of speech recognition virtual agents such as Cortana by Microsoft, voice recognition feature for google search etc.

How does it work?

A speech recognition software’s algorithms combine both language and acoustic modelling in order to recognize and distinguish different words and provide higher accuracy. Language modelling helps match the spoken words with actual words to avoid any mistake in between the words that sound similar, whereas acoustic modelling helps to recognise the language units with the audio signals.

The current speech recognition system is largely based on hidden Karkov models which help in improving overall efficiency and accuracy.

Uses: -

Speech recognition has tons of applications in distinct industries. Some of them are listed below: -

  1. Military: - The military has been actively using this system in many operations such as training air traffic controllers, in helicopters as well as fighter jets. The pilots use this tech to give commands to the auto-pilot, set steering coordinates as well as adjust radio frequencies.
  2. Education: - Learning a second language, improving spoken proficiency skills, listening to new words pronunciation etc. are some uses of this technology in the education sector. Now the blind students are able to use the computer properly by giving and listening to spoken commands and messages. Having interactions about a particular topic with the computer helps the students to understand the subject better.
  3. Day-to-Day life: - Voice search, speech-to-text, voice calls etc. have made the life of the people really easy and more efficient.

Positives and Negatives: -
Although there are continuous developments in this sector every now and then to make it better, the speech recognition system still has a lot of work to be done in order to make it appeal to an even wider public. The biggest positive of this system is that it is easier to use and now is being more readily available to the public to test it out themselves.

The negative part is the lack of support to many languages other than English and its inability to capture and present words due to different accents and pronunciation style of the people which lead to a higher degree of inaccuracy. Plus, to use it properly, the users must have a quiet background with no noise other than their voice which is practically impossible to achieve.

Conclusion: -

Overall, the industry has seen some massive recent developments which are expected to increase in number in order to make this technology a success in the near future. Features like background noise cancellation, support to non-English languages etc. are required for it to appeal to the people.


© copyright 2017 www.aimlmarketplace.com. All Rights Reserved.

A Product of HunterTech Ventures