A Double Metaphone Encoding for Bangla and its Application in Spelling Checker Naushad UzZaman and Mumit Khan Center for Research on Bangla Language Processing BRAC University, Bangladesh 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering Oct 31, 2005 Donghu Hotel, Wuhan, China Naushad UzZaman

IEEE NLP KE 2005

1

Topics to be covered „ „ „ „

„

About Bangla / Bengali language Motivation for phonetic encoding Phonetic encoding Performance of phonetic encoding in spelling checker Conclusion

Naushad UzZaman

IEEE NLP KE 2005

2

Background of Bengali / Bangla „

Spoken mainly in {

„

Native speakers { {

„

Bangladesh, Indian states of West Bengal, Tripura, Assam More than 200 million 4th most widely spoken native language by Ethnologue survey

Bengali/Bangla { {

Bengali is the exonym Bangla (বাংলা ) is the ethnonym

Naushad UzZaman

IEEE NLP KE 2005

3

Generic Classification of Bangla language

Naushad UzZaman

IEEE NLP KE 2005

4

Generic Classification of Bangla script

Naushad UzZaman

IEEE NLP KE 2005

5

Example of Bangla script Consonant ক খ গ ঘ

IPA /kɔ/

Vowel Vowel sign IPA with KA (ক) /kɔ/ a ক (none) and ko

/gɔ/



ক ◌া = কা

ka

i

ক ি◌ = িক

ki

u

ক ◌ু = k

ku

/khɔ/

/ghɔ/

Consonant Cluster

Constituents

k

ক+◌্ +ষ ঞ+◌্+চ জ+◌্+ঞ ল+◌্+ম

j lNaushad UzZaman

IEEE NLP KE 2005

Vowel

11

Consonant

49

Consonant Cluster

More than 200

6

Motivation „

Complex orthographic rules, large gap between script and pronunciation in Bangla

Naushad UzZaman

IEEE NLP KE 2005

7

Phonetic Encoding „

„

Encodes a word based on its pronunciation Similar sounding words have same code

Naushad UzZaman

IEEE NLP KE 2005

8

Example of Spell Checking Using Encoding Dictionary

Encoded

Word List

Word List

aকালপk /ɔkalpɔkko/ সকাল /ʃɔkal/ পাষাণ /paʃan/ দg /d̪ɔgd̪ho/

“okalpkk” “skal” “pasan” “dgd”

Naushad UzZaman

Encoded Test word

Test Word

“skal”

শকাল /ʃɔkal/

Search the encoded misspelled word in the encoded word list rather than searchingIEEEthe misspelled word in NLP KE 2005 the Dictionary word list

9

Phonetic encoding in English „

Established phonetic encoding in English: { { { {

Soundex Metaphone Phonix Double metaphone

Naushad UzZaman

IEEE NLP KE 2005

10

Key concepts from English phonetic encoding „

Soundex: groups the letter of similar pronunciation and give them same code { {

„

Realize – 6004020 – 6420 Realise – 6004020 – 6420

Metaphone & Phonix: also considers the context of a letter to encode it {

Knight – NT Nite – NT

Naushad UzZaman

{

IEEE NLP KE 2005

11

Key concepts from English phonetic encoding „

Double metaphone: gives multiple codes to same word, if it is pronounced in more than two ways {

{ { {

Basinger is pronounced in both way as “Basin-gger” or “Basin-jer” Basinger - BSNJR Basin-gger - BSNKR Basin-jer - BSNJR

Naushad UzZaman

IEEE NLP KE 2005

12

Existing Encoding in Bangla „

„

Hoque and Kaykobad’s encoding, 2002 UzZaman and Khan’s encoding, 2004

Naushad UzZaman

IEEE NLP KE 2005

13

Hoque and Kaykobad’s phonetic encoding Table Name Group Member 1 ক, খ, গ, ঘ, k 2 চ, ছ, জ, ঝ, য 3 ট, ঠ, ড, ঢ 4 ত, থ, দ, ধ, t 5 প, ফ, ব, ভ 6 ঙ, ঞ, ◌ং 7 শ, স, ষ 8 র, ড়, ঢ়, ঋ 9 ন, ণ α ম Naushadβ UzZamanল

„For

example, কর্ম /kɔrmo/ will be converted to a 4 element code “α8a0”, with zero padding.

IEEE NLP KE 2005

14

UzZaman and Khan’s encoding, 2004 Code Group members 0 ◌্ , ে◌া, ◌ঁ “a” “i” “u” “e” “o” “k” “g” “m” Naushad UzZaman

“c” “j”

আ , ◌া i , ঈ , ি◌ , ◌ী u , ঊ , ◌ু , ◌ূ e , ে◌, ঐ , ৈ◌ a , o , ঔ , ে◌ৗ ক,খ গ,ঘ ম , ঙ , ◌ং চ,ছ য,জ,ঝ IEEE NLP KE 2005

A sample of the actual table

15

Example of UzZaman and Khan’s soundex

Misspelled word

Correct word

Encoding

খুমাড়

/khumar/

কুমার

/kumar/

kumar

পাসান

/paʃan/

পাষাণ

/paʃan/

pasan

দগধ /dɔgdho/

Naushad UzZaman

দg (দগ ◌্ ধ) /dɔgdho/ IEEE NLP KE 2005

dgd

16

Limitation of existing encodings „

„

Different pronunciation of constituents in consonant cluster context. let us consider k . { { {

k = ক /kɔ/ +◌্ +ষ /ʃɔ/; k pronounced as /kh/ kত /khɔt̪o/ is pronounced as খত /khɔt̪o/, where ষ /ʃɔ/ is silent

Naushad UzZaman

IEEE NLP KE 2005

17

Limitation of existing encodings „

Different pronunciation of letters or consonant clusters in different contexts: consider again k . {

{

„

At the beginning of a word „ (kত → খত /khɔt̪o/); In the middle or at the end of a word „ (দk → দকখ /d̪okkho/).

Multiple pronunciations of some letters in the same context, such as হ with ব: { {

আhান আhান

Naushad UzZaman

→ আoভান /aovan/. → আহভান /ahobhan/ IEEE NLP KE 2005

18

Proposed phonetic encoding „ „ „

Double Metaphone phonetic encoding No of transformation: 108 Includes all vowels, consonants, consonant clusters (named jukhtakhor in Bangla)

Naushad UzZaman

IEEE NLP KE 2005

19

Sample Encoding Rules for k Soundex Encoding

“k”



KA

\u0995

0 (zero)

◌্

Virama/Hasant

\u0981

"s"



SSA

\u09B7

Double Metaphone Encoding k

\u0995\u09CD\u09B7

“k”

@the beginning

kত

k

\u0995\u09CD\u09B7

“kk”

@ middle/end

দk

Naushad UzZaman

IEEE NLP KE 2005

20

Performance in spelling checker No of words

1607*

Correct (Edit Distance 0)

1473

Error

134

Rate of accuracy

91.67%

Rate of error

8.33%

*Source of words: Bangla Banan Obhidhan, Dr. Khurshid Naushad UzZaman IEEE NLP KE 2005 Alam, Mirnava, Dhaka, Bangladesh.

21

Performance in spelling checker No of words

1607

Correct (Edit Distance 0)

1473

Error

134

Rate of accuracy

91.67%

Rate of error

8.33%

Naushad UzZaman

Error

134

8.33%

Edit Distance 1

107

6.65%

Edit Distance 2

27

1.68%

IEEE NLP KE 2005

22

Summary and Conclusion „ „ „ „ „

Proposed a double metaphone phonetic encoding for Bangla Handles the complexity of Bangla orthographical rules Used the encoding in Spelling checker 92% accuracy in spelling checker with just the phonetic encoding For our particular sample set we get a 100 % accuracy by using phonetic encoding and 2 edit distance.

Naushad UzZaman

IEEE NLP KE 2005

23

Question „

?

Naushad UzZaman

IEEE NLP KE 2005

24

The Summer Institute for Linguistics (SIL) Ethnologue Survey (1999) „

(SIL) Ethnologue Survey (1999) lists the following as the top languages by population: number of native speakers in parentheses)

1.Chinese* (937,132,000) { 2.Spanish (332,000,000) { 3.English (322,000,000) { 4.Bengali (189,000,000) { 5.Hindi/Urdu (182,000,000) { 6.Arabic* (174,950,000) { 7.Portuguese (170,000,000) { 8.Russian (170,000,000) { 9.Japanese (125,000,000) { 10.German (98,000,000) Naushad UzZaman IEEE NLP KE 2005 { 11.French* (79,572,000) {

25

Dr. Bernard Comrie’s article for the Encarta Encyclopedia (1998) „

The following list is from Dr. Bernard Comrie’s article for the Encarta Encyclopedia (1998): (number of native speakers in parentheses)

1.Mandarin Chinese (836 million) { 2.Hindi (333 million) { 3.Spanish (332 million) { 4.English (322 million) { 5.Bengali (189 million) { 6.Arabic (186 million) { 7.Russian (170 million) { 8.Portuguese (170 million) { 9.Japanese (125 million) { 10.German (98 million) Naushad UzZaman IEEE NLP KE 2005 { 11.French (72 million) {

26

http://www.aneki.com/languages.html Source: University of Washington

Rank

Language

No of Speaker

1 Chinese (Mandarin)

1,000,000,000 +

2 English

508,000,000

3 Hindustani (Hindi and Urdu)

497,000,000

4 Spanish

392,000,000

5 Russian

277,000,000

6 Arabic

246,000,000

7 Bengali

211,000,000

8 Portuguese

191,000,000

9 Malay-Indonesian

159,000,000

10 French

Naushad UzZaman

IEEE NLP KE 2005

129,000,00027

Found in 15,162,317 words Error zone length % of word (in no. of char.) 1 41.36 2

32.94

3

16.58

4

7.10

5

1.78

6

0.24

B. B. Chaudhuri, “Reversed word dictionary and phonetically similar word grouping based spell-checker to Bangla text”, Proc. Naushad UzZaman IEEE NLP KE 2005 28 LESAL Workshop, Mumbai, 2001.

A Double Metaphone Encoding for Bangla and its ...

Oct 31, 2005 - element code “α8a0”, with zero .... article for the Encarta Encyclopedia (1998):. (number of ... http://www.aneki.com/languages.html. Source: ...

523KB Sizes 5 Downloads 251 Views

Recommend Documents

A Double Metaphone Encoding for Bangla and its ... - Semantic Scholar
and present a comparison with the traditional edit-distance based methods in ... able to produce “good” suggestions for misspelled Bangla words unless ...

a double metaphone encoding for approximate name ...
However, there is always a large degree of phonetic similarity in the spelling .... BHA. \u09AD. “b”. 78 x\u09CD \u09AE... Not Coded. @ the beginning sরণ (ʃɔron).

phonetic encoding for bangla and its application to ...
These transformations provide a certain degree of context for the phonetic ...... BHA. \u09AD. “b” x\u09CD \u09AE... Not Coded @ the beginning sরণ /ʃɔroɳ/.

A Bangla Phonetic Encoding for Better Spelling ... - Semantic Scholar
Encode the input word using phonetic coding rules;. 2. Look up a phonetically ..... V. L. Levenshtein, “Binary codes capable of correcting deletions, insertions ...

Diagnosing double object constructions in Bangla ...
claims made with regard to double object constructions/DOCs in Japanese in Miyagawa ... Carefully examining similar sets of data in Bangla, the paper considers whether .... nor an A-movement derivation can lead to a well-formed output. ...... possibl

TTS for Low Resource Languages: A Bangla ... - Research at Google
Alexander Gutkin, Linne Ha, Martin Jansche, Knot Pipatsrisawat, Richard Sproat. Google, Inc. 1600 Amphitheatre Parkway, Mountain View, CA. {agutkin,linne,mjansche,thammaknot,rws}@google.com. Abstract. We present a text-to-speech (TTS) system designed

Bangla Text Input and Rendering Support for Short ...
To develop such application we don't have so many .... Appforge Crossfire [12] which create environment ... development Environment) because of debugging.

TTS for Low Resource Languages: A Bangla ... - Research at Google
For the best concatenative unit-selection ... taught us the importance of good data collection tools. ... For the recordings we used an ASUS Zen fanless laptop.

Narrow Bus Encoding for Low Power Systems
Abstract. High integration in integrated circuits often leads to the prob- lem of running out of pins. Narrow data buses can be used to alleviate this problem at the cost of performance degradation due to wait cycles. In this paper, we address bus co

A Sparse Embedding and Least Variance Encoding ... - IEEE Xplore
Abstract—Hashing is becoming increasingly important in large-scale image retrieval for fast approximate similarity search and efficient data storage.

Bangla Text Input and Rendering Support for Short ...
support on mobile devices for short message service. ... There are 14 million mobile users in Bangladesh and it .... public class BanglaUi extends MIDlet {.

a comprehensive bangla spelling checker - Semantic Scholar
suggestions), compare the methodologies with existing solutions available in the ... is an essential component of many of the common desktop applications.

A Comprehensive Bangla Spelling Checker
Feb 17, 2006 - Kukich (1992) breaks down human typing errors in two classes. ▫ Typographical error. ▫ People's mistake while typing. ▫ E.g. spell as speel.

a comprehensive bangla spelling checker - Semantic Scholar
spelling checker, one such application, is an essential component of many of the common desktop applications such as word processors as well as the more ...

the asymmetry of historical actuality: its double focusing ...
So, Tanabe's grand project of a dialectical unification ..... Introduction to the Philosophy of Tanabe, Rodopi, Amsterdam/Atlanta, Eerdmans. 1990. Nishida, Kitaro ...

Learning encoding and decoding filters for data representation with ...
Abstract—Data representation methods related to ICA and ... Helsinki Institute for Information Technology, Dept. of Computer Science,. University of Helsinki.

Encoding Demonstrations and Learning New ...
Encoding Demonstrations and Learning New Trajectories using Canal Surfaces. ∗. S. Reza ... are fast and can be used in an online manner, but the repro- ..... ing the distance from each point on the directrix to the cor- responding points on the ...

A System for Morphophonological Learning and its ...
To generate such a system, the grammar must permit small moves along the phonological scale. (as in /e/ ...... 'business-adj' sljiz@t5". > ... accounting for approximately 1% of nouns in Zaliznjak (1977) alternate stress between the first syllable ..

A Public Toolkit and ITS Dataset for EEG
Using these per-epoch features, the pipeline derives a set of higher-level features (e.g. mean, variance) for each trial. Our word-exposure classifier used 5 features - the means of each of the alpha, beta, gamma, theta, delta features of the epochs.

γH2AX and its role in DNA double-strand break repair1
Key words: Checkpoint recovery, chromatin, double-strand break repair, γH2AX, H2A, ..... with γH2A (our unpublished data), one interesting possibil- ity is that ...

Geometric Encoding
troid and approximate length of the space is a. 85 summary parameter derived from the boundaries. 86. Thus, global geometric coding with respect to the. 87 principal axis would involve encoding the correct. 88 location during training as the location

γH2AX and its role in DNA double-strand break repair1
budding yeast H2A (metazoan histone H2AX) to create γH2A (or γH2AX). ... 2Corresponding author (e-mail: [email protected]). .... the broken chromosome must locate an intact, undamaged homologous template, which will be cop-.

Article_Semantic Transfer and Its Implications for Vocabulary ...
Article_Semantic Transfer and Its Implications for Vocabulary Teaching in a Second Language_habby 2.pdf. Article_Semantic Transfer and Its Implications for ...