Speech Recognition for Mobile Devices at Google Mike Schuster Google Research, 1600 Amphitheatre Pkwy., Mountain View, CA 94043, USA [email protected]

Abstract. We briefly describe here some of the content of a talk to be given at the conference.



At Google, we focus on making information universally accessible through many channels, including through spoken input. Since the speech group started in 2005 we have developed several successful speech recognition services for the US and for some other countries. In 2006 we launched GOOG-411 in the US, a speech recognition driven directory assistance service which works from any phone. As smartphones like the iPhone, BlackBerry, Nokia s60 platform and phones running the Android operating system like the Nexus One and others becoming more widely used we shifted our efforts to provide speech input for the search engine (Search by Voice) and other applications on these phones. Many recent smartphones have only soft keyboards which can be difficult to type on, especially for longer input words and sentences. Some Asian languages, for example Japanese and Chinese are more difficult to type as the basic number of characters is very high compared to Latin alphabet languages. Spoken input is a natural choice to improve on many of these problems, and more details are discussed in the sections below. We have also been working on voice mail transcription and YouTube transcription for US English, which are also publically available products in the US, but the focus here will be on speech recognition in the context of mobile devices.



GOOG-411 is Google’s speech recognition based directory assistance service operating in the US and Canada [1], [2]. This application uses a toll-free number, 1-800-GOOG-411 (1-800-4664-411). The user is prompted to say city, state and the name of the business s(he) is looking for. Using text-to-speech the service can give address and phone number, or can connect the user directly to the business. As backend information from Google Maps Local is used. While this is a useful application to search for restaurants, stores etc. it is limited to businesses. Other difficulties with this kind of service include the B.-T. Zhang and M.A. Orgun (Eds.): PRICAI 2010, LNAI 6230, pp. 8–10, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Speech Recognition for Mobile Devices at Google


necessity of a dialog, relatively expensive operating costs, listing errors in the backend database, and most importantly to not be able to give richer information (as on a smartphone screen) back to the user.


Voice Search

In 2008 Google launched Voice Search in the US for several types of smartphones [3]. Voice Search adds simply the ability to speak a search query to the phone instead of having to type it into the browser. The audio is sent to Google servers where it is recognized and the recognition result along with the search result is sent back to the phone. The data goes over the data channel instead of the voice channel which allows higher quality audio transmission and therefore better recognition rates. Our speech recognition technology is relatively standard, below some details. Front-End and Acoustic Model. For the front-end we use 39-dimensional PLP features with LDA. The acoustic models are ML and MMI trained, triphone decision-tree tied 3-state HMMs with currently up to 10k states total. The state distributions are modeled by 50-300k diagonal covariance Gaussians with STC. We use a time-synchronous finite-state transducer (FST) decoder with Gaussian selection for speedy likelihood calculation. Dictionary. Our phone set contains between 30 and 100 phones depending on the language. We use between 200k and 1.5M words in the dictionary, which are automatically extracted from the web-based query stream. The pronunciations for these words are mostly generated by an automatic system with special treatment for numbers, abbreviations and other exceptions. Language Model. As our goal is to recognize search queries we mine our language model data from web-based anonymous search queries. We mostly use 3-grams or 5-grams with Katz backoff trained on months or years of query data. The language models have to be pruned appropriately such that the final decoder graphs fit into memory of the servers. Acoustic Data. To train an initial system we collect roughly 250k of spoken queries using an Android application specifically designed for this purpose [4]. Several hundred speakers read queries off a screen and the corresponding voice samples are recorded. As most queries are spoken without errors we don’t have to manually transcribe these queries. Metrics. We want to optimize user experience. Traditionally speech recognition systems focus on minimizing word error rate. This is also a useful measure for us, but better is a normalized sentence error rate as it doesn’t depend as much on the definition of a word. As the metric which approximates user experience best we use WebScore: We send hypothesis and reference to a search backend and


M. Schuster

compare the links we get back. Assuming that the reference generates the correct search result this way we know whether the search result for the hypothesis is within the first three results – such that the user can see the correct result on his smartphone screen. Languages. After US English we launched Voice Search for the UK, Australia and India. Late 2009 Mandarin Chinese [5] and Japanese were added. Foreign languages pose many additional challenges. For example, some Asian languages like Japanese and Chinese don’t have spaces between words. For these we wrote a segmenter which optimizes the word definitions maximizing sentence likelihood. Most languages have characters outside the normal ASCII set, in some cases thousands, which complicate automatic pronunciation rules. Additional Challenges. There are many details which are critical to get right for a good user experience but we cannot discuss here because of space constraints. These include getting the user interface right, optimizing protocols for minimum latency, dealing with special cases like numbers, dates and abbreviations correctly, avoid showing offensive queries and improving the system efficiently after launch using the data coming in.



For mobile devices speech is an attractive input modality and besides Voice Search we have been working on other features, including moer general Voice Input [6], contact dailing (as launched in the US) and recognition of special phrases to trigger certain applications on the phone. We believe that in the next few years speech input will become more accurate, more accepted and useful enough to help users efficiently access and navigate through information provided through mobile devices.

References 1. Bacchiani, M., Beaufays, F., Schalkwyk, J., Schuster, M., Strope, B.: Deploying GOOG-411: Early lessons in data, measurement, and testing. In: Proceedings of ICASSP, pp. 5260–5263 (2008) 2. van Heerden, C., Schalkwyk, J., Strope, B.: Language Modeling for What-withWhere on GOOG-411. In: Proceedings of Interspeech, pp. 991–994 (2009) 3. Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Kamvar, M., Strope, B.: Google Search by Voice: A case study. In: Weinstein, A. (ed.) Visions of Speech: Exploring New Voice Apps in Mobile Environments, Call Centers and Clinics. Springer, Heidelberg (2010) (in Press) 4. Hughes, T., Nakajima, K., Ha, L., Vasu, A., Moreno, P., LeBeau, M.: Building transcribed speech corpora quickly and cheaply for many languages. In: Interspeech (submitted 2010) 5. Shan, J., Wu, G., Hu, Z., Tang, X., Jansche, M., Moreno, P.: Search by Voice in Mandarin Chinese. In: Interspeech (submitted 2010) 6. Ballinger, B., Allauzen, C., Gruenstein, A., Schalkwyk, J.: On-Demand Language Model Interpolation for Mobile Speech Input. In: Interspeech (submitted 2010)

Speech Recognition for Mobile Devices at Google

phones running the Android operating system like the Nexus One and others becoming ... decision-tree tied 3-state HMMs with currently up to 10k states total.

72KB Sizes 4 Downloads 403 Views

Recommend Documents

Word Embeddings for Speech Recognition - Research at Google
to the best sequence of words uttered for a given acoustic se- quence [13, 17]. ... large proprietary speech corpus, comparing a very good state- based baseline to our ..... cal speech recognition pipelines, a better solution would be to write a ...

Address Space Randomization for Mobile Devices - Research at Google
mechanism for Android, evaluate its effectiveness, and mea- sure its impact on ... discuss future and related work, and Section 10 concludes. 2. OVERVIEW OF ...

pdf-0738\face-detection-and-recognition-on-mobile-devices-by ...
pdf-0738\face-detection-and-recognition-on-mobile-devices-by-haowei-liu.pdf. pdf-0738\face-detection-and-recognition-on-mobile-devices-by-haowei-liu.pdf.

A large number of the ... number of parameters in MAP adaptation can be as large as the ..... BOARD: Telephone Speech Corpus for Research and Devel-.

CASA Based Speech Separation for Robust Speech Recognition
National Laboratory on Machine Perception. Peking University, Beijing, China. {hanrq, zhaopei, gaoqin, zhangzp, wuhao, [email protected]}. Abstract.

ai for speech recognition pdf
Page 1 of 1. File: Ai for speech recognition pdf. Download now. Click here if your download doesn't start automatically. Page 1. ai for speech recognition pdf.

ing deals with the problem of how to represent a given input spectro-temporal ..... ICASSP, 2007. [7] B.A. Olshausen and D.J Field, “Emergence of simple-cell re-.

2Human Language Technology, Center of Excellence, ... coding information. In other words ... the l1 norm of the weights of the linear combination of ba-.

Robust Speech Recognition Based on Binaural ... - Research at Google
degrees to one side and 2 m away from the microphones. This whole setup is 1.1 ... technology and automatic speech recognition,” in International. Congress on ...

model components of a traditional automatic speech recognition. (ASR) system ... voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural

Challenges in Automatic Speech Recognition - Research at Google
Case Study:Google Search by Voice. Carries 25% of USA Google mobile search queries! ... speech-rich sub-domains such as lectures/talks in ... of modest size; 2-3 orders of magnitude more data is available multi-linguality built-in from start.

as email and short message dictation. Without pre-specifying the ..... gual Education”, CMU, 1999. http://www.cal.org/resources/Digest/digestglobal.html.

Emotional speech recognition
also presented for call center applications (Petrushin,. 1999; Lee and Narayanan, 2005). Emotional speech recognition can be employed by therapists as a diag ...

Programming mobile devices - an introduction for practitioners.pdf ...
Programming mobile devices - an introduction for practitioners.pdf. Programming mobile devices - an introduction for practitioners.pdf. Open. Extract. Open with.

Overview: Use mobile devices at work - G Suite
Sign in to admin.google.com with your G Suite username and password. 2. ... For more detailed instructions to register devices, get work apps, find out about ...

Learning Battery Consumption of Mobile Devices - Research at Google
Social networking, media streaming, text mes- saging, and ... chine Learning, New York, NY, USA, 2016. ... predict the battery consumption of apps used by ac-.

The Kaldi Speech Recognition Toolkit
Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used ... widely available databases such as those provided by the. Linguistic Data Consortium (LDC). Thorough ... tion of DiagGmm objects, indexed

Speech Recognition in reverberant environments ...
suitable filter-and-sum beamforming [2, 3], i.e. a combi- nation of filtered versions of all the microphone signals. In ... microphonic version of the well known TI connected digit recognition task) and Section 9 draws our ... a Recognition Directivi

energy speech signal while the other one is trained to recognize the low energy speech signal. Suppose we are given a clean training dataset X, we first perform ...

Optimizations in speech recognition
(Actually the expected value is a little more than $5 if we do not shuffle the pack after each pick and you are strategic). • If the prize is doubled, you get two tries to ...

large scale discriminative training for speech recognition
used to train HMM systems for conversational telephone speech transcription using ..... compare acoustic and language model scaling for several it- erations of ...

raining for Large Vocabulary Speech Recognition ...
This thesis investigates the use of discriminative criteria for training HMM pa- rameters for speech ... tures of this implementation include the use of lattices to represent alternative transcriptions of the ..... information from two sources : a st