Voice Browser and Multimodal Interaction In 2009

Paolo Baggia Director of International Standards March 6th, 2009

Google TechTalk

Google TechTalk – Mar 6th, 2009

Paolo Baggia

11

Overview A Bit of History W3C Speech Interaction Framework Today ASR/DMTF TTS Lexicons Voice Dialog and Call Control Voice Platforms and Next Evolutions

W3C Multimodal Interaction Today MMI Architecture EMMA and InkML A language for Emotions

Next Future Google TechTalk – Mar 6th, 2009

Paolo Baggia

2

Company Profile 

Privately held company (fully owned by Telecom Italia), founded in 2001 as spin-off from Telecom Italia Labs, capitalizing on 30yrs experience and expertise in voice processing.



Global Company, leader in Europe and South America for award-winning, high quality voice technologies (synthesis, recognition, authentication and identification) available in 26 languages and 62 voices.



Multilingual, proprietary technologies protected over 100 patents worldwide London



Financially robust, break-even reached in 2004, revenues and earnings growing year on year



Growth-plan investment approved for the evolution of products and services.





Offices in New York. Headquarters in Torino, local representative sales offices in Rome, Madrid, Paris, London, Munich

Munich

Paris

Madrid Torino New York Rome

Flexible: About 100 employees, plus a vibrant ecosystem of local freelancers.

Google TechTalk – Mar 6th, 2009

Paolo Baggia

3

International Awards “2008 Frost & Sullivan European Telematics and Infotainment Emerging Company of the Year” Award Winner of “Market leader-Best Speech Engine” Speech Industry Award 2007 and 2008 Loquendo MRCP Server: Winner of 2008 IP Contact Center Technology Pioneer Award “Best Innovation in Automotive Speech Synthesis” Prize AVIOS-SpeechTEK West 2007 “Best Innovation in Expressive Speech Synthesis” Prize AVIOS-SpeechTEK West 2006 “Best Innovation in Multi-Lingual Speech Synthesis” Prize AVIOS-SpeechTEK West 2005 Google TechTalk – Mar 6th, 2009

Paolo Baggia

4

A Bit of History

Google TechTalk – Mar 6th, 2009

Paolo Baggia

5

Standard Bodies Two main standard bodies: W3C – World Wide Web Consortium Founded in 1994, by Tim Berners-Lee with a mission to lead the Web to its full potential. Staff based in MIT (USA), ERCIM (France), Keio Univ (Japan). 400 members all over the world, 50 Working, Interest and Coordination Groups. W3C is where the framework of today’s Web is developed (HTML, CSS, XML, DOM, SOAP, RDF, OWL, VoiceXML, SVG, XSLT, P3P, XML, Internationalization, Web Accessibility, Device Independence)

IETF – Internet Engineering Task Force Founded in 1986, but growth in 1991as Internet Society. 1300 members. HTTP, SIP, RTP and many others protocols. Media Resource Control Protocol (MRCP) is very relevant for speech platforms.

Two industrial forums: VoiceXML Forum (www.voicexml.org) Inventors of VoiceXML 1.0, then submitted to W3C for standardization. Current goal is to promote, disseminate and support VoiceXML and related standards.

SALT Forum (www.saltforum.org) Supported by Microsoft to define a lightweight markup for telephony and multimodal applications.

Other relevant bodies: 3GPP, OMA, ETSI, NIST Google TechTalk – Mar 6th, 2009

Paolo Baggia

6

The (r)evolution of VoiceXML 1998 - 2004

W3C charters Voice Browser WG

W3C charters Multimodal Interaction WG VoiceXML Forum Birth

2000

1998

By AT&T, IBM, Lucent, Motorola,

1999 W3C Voice Browser Workshop

SALT Forum Birth

EMMA 1.0 W3C Rec

By Cisco, Comverse, Intel, Microsoft, Philips, SpeechWorks,

2004

PLS 1.0 W3C REC

2007 2008

2002 VoiceXML 1.0 Released

SSML 1.0 W3C1.0 Rec SRGS W3C 2.0 Rec VoiceXML W3C Rec

2009

SISR 1.0 W3C VoiceXMLRec 2.0 W3C Rec

Preparing to announce VoiceXML 1.0 Friday Feb. 25th, 2000 Lucent, Naperville, Illinois Left to right: Gerald Karam (AT&T), Linda Boyer (IBM), Ken Rehor (Lucent), Bruce Lucas (IBM), Pete Danielsen (Lucent), Jim Ferrans (Motorola), Dave Ladd (Motorola).

Google TechTalk – Mar 6th, 2009

Paolo Baggia

7

Speech Interface Framework in 2000 (by Jim Larson)

Semantic Interpretation for Speech Recognition (SISR) N-gram Grammar ML Speech Recognition Grammar Spec. (SRGS)

ASR

Language Understanding

EMMA Natural Language Semantics ML

VoiceXML 2.1 VoiceXML 2.0

Context Interpretation

World Wide Web

DTMF Tone Recognizer Pronunciation Lexicon Specification (PLS)

User

Dialog Manager

Pre-recorded Audio Player

TTS

Language Generation

Speech Synthesis Markup Language (SSML)

Google TechTalk – Mar 6th, 2009

Media Planning

Reusable Components

Telephone System

Call Control XML (CCXML)

Paolo Baggia

8

Speech Interface Framework - Today (by Jim Larson) Semantic Interpretation for Speech Recognition (SISR)

N-gram Grammar ML Speech Recognition Grammar Spec. (SRGS)

ASR

Language Understanding

EMMA 1.0 Natural Language Semantics ML

VoiceXML 2.1 VoiceXML 2.0

Context Interpretation

World Wide Web

DTMF Tone Recognizer Pronunciation Lexicon Specification (PLS)

User

Dialog Manager

Pre-recorded Audio Player

TTS

Language Generation

Speech Synthesis Markup Language (SSML)

Google TechTalk – Mar 6th, 2009

Media Planning

Reusable Components

Telephone System

Call Control XML (CCXML)

Paolo Baggia

9

Speech Interface Framework - End of 2009 (by Jim Larson) Semantic Interpretation for Speech Recognition (SISR)

N-gram Grammar ML Speech Recognition Grammar Spec. (SRGS)

ASR

Language Understanding

EMMA 1.0 Natural Language Semantics ML

VoiceXML 2.1 VoiceXML 2.0

Context Interpretation

World Wide Web

DTMF Tone Recognizer Pronunciation Lexicon Specification (PLS)

User

Dialog Manager

Pre-recorded Audio Player

TTS

Language Generation

Speech Synthesis Markup Language (SSML)

Google TechTalk – Mar 6th, 2009

Media Planning

Reusable Components

Telephone System

Call Control XML (CCXML)

Paolo Baggia

10

W3C Process

Google TechTalk – Mar 6th, 2009

Paolo Baggia

11

Architectural Changes Traditional (proprietary) architecture

ASR / DTMF

Speech Applic.

User

Proprietary SCE

TTS / Audio Proprietary platform

VoiceXML architecture

.grxml/.gram, .pls

ASR / DTMF

VoiceXML Browser

User TTS / Audio

.vxml HTTP

Web Applic.

VoiceXML platform

.ssml, .wav/.mp3, .pls

Google TechTalk – Mar 6th, 2009

Paolo Baggia

12

The VoiceXML Impact VoiceXML changed the landscape of IVRs and speech application creation From proprietary to standard-based speech applications Before • Proprietary platforms (HW & SW) • Proprietary applications (by proprietary SCE) • Mainly DTMF and pre-recorded prompts • First attempts to add speech into IVR

Google TechTalk – Mar 6th, 2009

After • Standard VoiceXML platforms • Standards for Speech Technologies • Standard tools for VoiceXML applications • Integration of DTMF and ASR • Still predominance of DTMF, but more and more speech applications

Paolo Baggia

13

Overview  A Bit of History W3C Speech Interaction Framework Today ASR/DMTF TTS Lexicons Voice Dialog and Call Control Voice Platforms and Next Evolutions

 W3C Multimodal Interaction Today  MMI Architecture  EMMA and InkML  A language for Emotions

 Next Future Google TechTalk – Mar 6th, 2009

Paolo Baggia

14

Standards for ASR and DTMF SRGS 1.0, SISR 1.0

Google TechTalk – Mar 6th, 2009

Paolo Baggia

15

W3C Standards for Speech/DTMF Grammars

SYNTAX

Speech Defines constraints on admissible sentences for grammar a specific recognition turn

SEMANTICS Describes how to produce results after an utterance is recognized

SRGS SRGS ABNF ABNF

XML XML

voice voice

dtmf dtmf

http://www.w3.org/TR/speech-grammar/ Google TechTalk – Mar 6th, 2009

SISR SISR literal literal

script script

http://www.w3.org/TR/semantic-interpretation/ Paolo Baggia

16

SRGS/SISR Grammars for “Torino” SRGS XML

SISR literal

Torino 10100

SRGS ABNF

#ABNF 1.0 iso-8859-1; mode voice; tag-format ; public $main = Torino {10100} ;





SISR script

var unused=7; Torino out="10100";

#ABNF 1.0 iso-8859-1; mode voice; tag-format ; {var unused=7;}; public $main = Torino {out="10100";} ;



Google TechTalk – Mar 6th, 2009

Paolo Baggia

17

SRGS/SISR Standards – Pros Powerful syntax (CFG) and very powerful semantics (ECMA) DMTF and Voice input are transparent to the application Wide and consistent adoption among technology vendors Two syntax XML and ABNF are great!  Developers can choose (XML validation vs. compact format)  Transformations are possible XML  ABNF (easy, simple XSLT) ABNF  XML (requires a ABNF parser)  Open Source tools might be created to:    

Validate grammar syntax Transform grammars Debug grammars on written input Coverage tests: explode covered sentences, GenSem, SemTester, etc.

Google TechTalk – Mar 6th, 2009

Paolo Baggia

18

SRGS/SISR Standards – Small Issues Semantics declaration: tag-format attribute  If value “semantics/1.0”?  Mandate SISR Script semantics inside semantic tags  If value “semantics/1.0-literal”?  Mandate SISR Literal semantics inside semantic tags  If missing?  Unclear! Risk of interoperability troubles

SISR Script Semantics  Clumsy default assignment: returns last referenced rule only  Developer must properly pop-up results  Be careful to redefine “out”  Assign a scalar value might result in errors

SISR Literal Semantics  Only useful for very simple word-list rules  No support for encapsulating rules  SISR Literal grammars as external references ONLY! Google TechTalk – Mar 6th, 2009

Paolo Baggia

19

SRGS/SISR – Encapsulated Grammars

Gr2.gram Literal Gr41.grxml Literal

Gr1.grxml Script Gr3.grxml Script

Gr42.gram Script

Google TechTalk – Mar 6th, 2009

Paolo Baggia

20

SRGS/SISR Standards – Rich XML Results Section 7 of SISR 1.0 specification http://www.w3.org/TR/semantic-interpretation/#SI7

Serialization rules from SISR ECMA results into XML Edge cases: Arrays Special variable “_attribute” and “_value” Creation of namespaces and prefixes { drink: { _nsdecl: { _prefix:"n1", _name:"http://www.example.com/n1" }, _nsprefix:"n1", liquid: { _nsdecl: { _prefix:"n2", _name:"http://www.example.com/n2" }, _attributes: { color: { _nsprefix:"n2", _value:"black" } }, _value:"coke" }, size:"medium" }

coke medium

}

Google TechTalk – Mar 6th, 2009

Paolo Baggia

21

SRGS/SISR Standards – Next Steps Adoption of the PLS 1.0 lexicon  Clear entry point into PLS lexicons, element  Missing role attribute in to allow homographs disambiguation

Next extensions via Errata  XML 1.1 support and IR  Update normative references

 No Major Extensions are needed!

Google TechTalk – Mar 6th, 2009

Paolo Baggia

22

Speech Synthesis SSML 1.0/1.1

Google TechTalk – Mar 6th, 2009

Paolo Baggia

23

TTS – Functional Architecture and Markup/Non-Markup support Structure Analysis

Text Normalization

Markup support:

, Non-Markup support: infer the structure by automatic text analysis

Text-toPhoneme Conversion

Markup support: , Non-Markup support: look up in pronunciation dictionary

Markup support: for date, time, phone number, numbers for acronyms and transliterations Non-Markup support: automatically identify and convert constructs

Prosody Analysis

Waveform Production

Markup support: ,

I don't speak Japanese.

Nihongo-ga wakarimasen.



Processing and Pronunciation –

and (paragraph and sentence) to give a structure to the text – element to indicate the type of text construct contained within the element ex. date, numbers, etc. – element to provides a phonetic pronunciation for the contained text in IPA – element to provide substitutions for expanding acronyms in sequence of words http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009

Paolo Baggia

25

SSML 1.0 – Language description (II) Style - element The moon is raising on the beach, when John says, looking Mary in the eyes: I love you! but she suddenly replies: Please, be serious!

Other voice selection attributes are: name, xml:lang, gender, age, and variant

- element requests that the contained text be spoken with emphasis level attribute can set it to strong, moderate, reduced, or none

- element controls the pausing between words time attribute with two kind of values: Time expressions “5s”, “20ms”

strength attribute with values: none, x-weak, weak, medium (default value), strong, or x-strong

http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009

Paolo Baggia

26

SSML 1.0 – Language description (III) Prosody element permits control of the pitch, speaking rate and volume of the speech output. The attributes are: volume: the volume for the contained text. rate: the speaking rate in words-per-minute for the contained text. duration: a value in seconds or milliseconds for the desired time to take to read the element contents. pitch: the baseline pitch for the contained text. range: the pitch range (variability) for the contained text in Hertz. contour: sets the actual pitch contour for the contained text.

Other elements

,
,

Input Speech recognition

Recording

Keypad

Output Audio files