Effect of jumbling intermediate words in Indian languages: An eye-tracking study

Report by: Bharat Ram Ambati (200502004)

Guide: Bipin Indurkhya

1) Introduction: When we look at some thing only some key features of it are send to the brain and based on those features entire image is constructed. We try to give less work to our sensory organs and more work to the brain. Similar is the case with reading a language text. Kids, as they are new to language reading, in the initial stages they try to read letter by letter in word and thus finally form the complete word. But, when we grow up, due practice and to save time, we don't read letter by letter in a word. We jump from one word to other while reading. We extract some key features and send it to the brain and there the entire word it formed. To observe this, some experiments were conducted by tracking the eyemovement while reading the text. It was found that while reading, we jump over the words. To prove this some text in which the first and last letters of a word are at right place and intermediate letters are jumbled is taken. People could able to read it without any problem. This experiment proved that our brain doesn't read every letter by itself but the word as a whole. Above experiments were conducted for English. In our experiment we tried to do the similar experiments for two Indian Languages namely Hindi1 and Telugu2.

2) Hypothesis: We have 3 hypothesis in our mind while doing this experiment. A. Human mind does not read every letter by itself but the word as a whole even in Indian Languages. B. Do we jump while reading a text with jumbled intermediate letters in a word? C.

Nature of jumbling and the length of the word effect the reading time and errors made.

Hindi is a verb final Indo-Aryan language with free word order and a rich case marking system. It is one of the official languages of India, and is spoken by ~422 million people. 1

Telugu is a Dravidian agglutinative language. It is one of the official languages of India, and is spoken by ~74 million people. 2

After doing the experiment we try to find which if these hold true experimentally.

3) Experimental Setup: 3.1) Data: For stimulus, we extracted Hindi and Telugu text from their respective Wikipedia. The content was a brief description of Hyderabad (State capital of Andhra Pradesh, India). Extracted text was manually corrected for spelling and grammatical errors. Table 1 gives some basic characteristics of the text for each language. Total no. of Words

Hindi Telugu 169 103

Average no. of syllables per word

2.73

3.95

Average no. of characters per word

4.92

7.81

Table 1: General Statistics. Level of Jumbling for Hindi and Telugu:

Unlike English, Indian Languages (ILs) are phonetic languages. So, the concept of akshara/syllable3 is very important in ILs. If we keep last and first letter at right place and jumble the intermediate letters then for ILs, jumbling can occur across the akshara boundaries. In the worst case all the consonants move to one side and all the vowels move to other side. This text becomes a total mess and highly unreadable. For example, Telugu వర (vAri) వ + ఆ + ర + ఇ (v + A + r + I) వ + ర + ఆ + ఇ (v + r + A + I) వరఇ (vrAi)

Hindi अनुसार आ + न् + उ + स ्् + आ + र् आ + न् + स ्् + आ + उ + र् अनसाउर्

In case of telugu, 'వర (vAri)' is finally transformed to 'వరఇ (vrAi)'. Similarly, in case of hindi 'अनुसार' became 'अनसाउर्'. This makes the text highly unpredictable. So, for ILs instead of jumbling at letter level (matra level), we A syllable or more specifically an open syllable follows the following pattern: C*V, where C=consonant, *=zero or more occurrence of the preceding element, and V=vowel. 3

jumbled at akshara/syllable level only. 5 experiments were conducted for each language. For each of these experiments the data set used is described in Table 2.

Table 2: Data sets description. As is clear from table 2, data sets 2-5 are modifications of the original text (set 1). Different types of jumbling were explored. In set 2 and 3 we fix the left most while jumbling the vowels and consonants respectively. Note that in both these sets the jumbling is done over short distance, i.e. the displacement never exceeds 2 units. In set4 and set5 the leftmost and the rightmost syllables are fixed and the intervening syllables are jumbled over short (<=2) and long (>2) distance respectively.

3.2) Procedure: A total of 60 subjects participated in the experiments; 35 subjects for Telugu and 25 for Hindi. All the subjects were undergraduate and graduate students between 20–27 years of age, with a mean of 22. For Telugu all the subjects were native

speakers of the language. In the case of Hindi some of the subjects were not native speakers but spoke it as their second-language and knew the script well. Each subject was given one set to read while the eye-tracker was used to track his/her eye-movements. We also recorded the subject’s voice for each reading session. For Telugu each set was read by 7 subjects while for Hindi 5 subjects read each set.

4) Results and Analysis: In this section first we present the results for each of the five experiments in turn, and discuss their implications. Then we discuss about standard parameters and evaluate each experiment based on them.

4.1) Experiment based 4.1.1) Experiment 1 Set1 being the original text, the average reading time and error rates are least among all the sets for both Hindi and Telugu. Subjects made very minor mistakes while reading the text. It is interesting to note that most of the mistakes were grammatically and contextually valid, though the lexical items read were absent in the text. In both Hindi and Telugu these errors were mostly due to adpositions. In Telugu the adpositions appear as suffixes where as in Hindi they are post-positions.

4.1.2) Experiment 2 In set2, only the vowels were jumbled. The predictions were relatively easy as the subjects mostly guessed the appropriate vowels based on the unchanged consonants. A direct correlation was observed between the word length and the correct prediction. This is because a long word will tend to have more consonants that will tend to restrict the search space. The opposite holds true for short length words. Short words have more error rate than long words (Figure 1a, 1b).

Figure 1a: Telugu error rate w.r.t. syllable length

Set1 Set2 Set3 Set4 Set5

2

(2 .9 % ) (1 5. 3 5% ) (2 0. 4 4% (2 ) 6. 5 2% ) (2 0. 4 6 %) (9 .7 7 %) (3 .9 % 8 ) (1 % )

0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

1

2

1

(2 .9 % (1 ) 5. 5 % 3 (2 ) 0. 4 4 % (2 ) 6. 2 % 5 (2 ) 0. 4% 6 (9 ) .7 7 %) (3 .9 % 8 ) (1 % )

0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

Figure 1b: Hindi error rate w.r.t. syllable length

4.1.3) Experiment 3 In set3, only the consonants were jumbled. Understandably, the correct predictions become scarce as the information provided by the word pattern does not restrict the search space. Because of this correct identification of the syllable became difficult which intern effected the correct identification of the word. For both Hindi and Telugu, this experiment showed highest error rate and highest reading time.

4.1.4) Experiment 4 As the first and last syllables are fixed and the jumbling done over short distance, the prediction was relatively easier than set2 and set3. It is easy to see why this is so. When compared to an individual unit such as a vowel or a consonant, a syllable would have more information encoded in it. In set2 and set3, only part of the syllable was available. In such a case the syllable should be guessed first from the partial information and only later the word can be guessed. In set4, on the other hand, the jumbling is at the syllable level and that too at short distance. One only needs to guess the word directly here. This seems a plausible explanation to account for the subjects taking less time to read the text and making few mistakes for both Hindi and Telugu.

4.1.5) Experiment 5 Set5 is similar to set4, but has long distance jumbling. When compared to set4, the syllables are moved farther from their original positions. This results in a higher reading time and error rate than set4. Though the displacement is long, an entire syllable is jumbled which makes more information available to predict the word than set3. Thus the error rate and reading time are lower than set3. This can be seen in Table 3.

Set1 Set2 Set3 Set4 Set5

Hindi Time (Seconds) 129.7 151.3 186.4 143.8 160.0

Error % 0.947 5.030 6.879 4.882 5.444

Telugu Time (Seconds) 94.2 122.4 140.6 103.6 109.4

Error % 0.161 4.577 7.282 3.467 4.935

Table 3: Average reading time and average error percentage.

So far we looked at two general parameters: average reading time and average error percentage. These two parameters give only some insights about the reading patterns. To get the complete picture we do the parameter based analysis.

4.2) Parameter based There are some standard metrices used while doing eye-tracking experiments on reading. Some are word based measures and some are region based measures. Each metric has speczial purpose and gives some special characteristics of the reading experiment being done like complexity of the issue dealt, validity of the conclusions made. Using data viewer software we extracted the three .xls files for each subject. First file gives complete details about the fixations. Second file gives complete details about the saccades. And the third file gives complete details about the interest areas. We did some coding using JAVA programming language to extract the required parameters from these three files. After extracting the parameters for each word we took the average over each set.

4.2.1) Word based Measures: 



First Fixation Duration The duration of the first fixation on a word, regardless of whether it is the only fixation on the word or the first of multiple fixations on a word, provided that the word is not skipped. Figure 2, shows the average first fixation duration. Single Fixation Duration The duration of fixation on a word when only one fixation is made on the word. Figure 3, shows the average single fixation duration.

600

200

500

150

400 Hindi

300

Telugu

200

Hindi

100

Telugu

50

100 0

0 Set1 Set2 Set3 Set4 Set5

Figure 2: Average First Fixation Duration

Set1 Set2 Set3 Set4 Set5

Figure 3: Average Single Fixation Duration

Sets 2 and 4 are relatively easier than sets 3 and 5. So, more number of words can be identified on first and single fixation. That is the reason for higher average single fixation duration for sets 2 and 4 compared to sets 3 and 5. 



Total Fixation Duration The sum of all fixations on a word before moving to another word. Figure 4, shows the average total fixation duration. Gaze Duration Total fixation duration plus the saccade duration. Figure 4, shows the average gaze duration.

900 800 700 600 500 400 300 200 100 0

1200 1000 800 Hindi Telugu

Telugu

400 200 0

Set1 Set2 Set3 Set4 Set5

Figure 4: Average Total Fixation Duration



Hindi

600

Set1 Set2 Set3 Set4 Set5

Figure 5: Average Gaze Duration

Go-Past Time This gives the total time spent on the word. It includes fixations durations, saccade durations, regression durations. Figure 6, shows the average go past time. Though set 2 is the easiest among the jumbled sets, it is interesting to see that set 2 has the highest average total fixation and gaze durations. There could be two reasons for it. In the other jumbled texts, due to complexity, most of the words are skipped resulting in lower durations than set 2. Second reason could be more number of regressions in case of latter sets than set 2. Go past time is also higher for set 2. This shows that the



Skipping This gives the information on the words that are fixated and the words that are skipped. Figure 4, shows the average skipping. Sets 2 and 4 are relatively easier than sets 3 and 5. So in the former sets word could be easily predicted from the initial few syllables itself. One need not go till the end syllable and spend lot of time on it to identify the word. Because of this most of the succeeding words doesn’t fall into the current window resulting in the fixation on the next word. So, the number of words skipped are relatively low. But in the later sets (3 and 5), as fixations are made till last syllable and as more time is spent several succeeding words fall into the current window. This resulted in the less number of fixations on the succeeding words. As a result more number of words are skipped.

2500 2000 1500

Hindi

1000

Telugu

500 0

0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

Set1 Set2 Set3 Set4 Set5

Telugu

Set1 Set2 Set3 Set4 Set5

Figure 6: Average Go Past Time



Hindi

Figure 7: Average Skipping

Regressions (in and out) Sometimes while reading the text, we come back. This information is captured by regressions. Figure 6, shows the average regression in and out values.

0.7

0.25

0.6

0.2

0.5 0.4

Hindi

0.3

Telugu

0.2

0.15

Hindi Telugu

0.1 0.05

0.1 0

0 Set1 Set2 Set3 Set4 Set5

Figure 8: Average Regression In

Set1 Set2 Set3 Set4 Set5

Figure 9: Average Regression Out

Set 2 is the simplest of all the four jumbled sets. So, subjects read the text fastly. Due to this they made some mistakes while predicting the words. In such cases they regressed and corrected them. This is the reason for high regression in values for set 2 compared to others. As sets 3 and 5 are the most complex ones, it is expected to have more regression out for them. It holds true for set 3. But in case of set 5 due to more skipping the value is less.

4.2.2) Region Based Measures: 

First pass reading time The sum of all fixations in a region from first entering the region until leaving the region, given the region is fixated at least once. It's initial reading time consisting of all forward fixations. In case of easiest of jumbled text more number of words will be recognized at the first reading itself. Hence first pass reading time should be higher for set 2 and least for set 5. Figure 10, shows this.

140000

350

120000

300

100000

250

80000 60000

Hindi

200

Hindi

Telugu

150

Telugu

40000

100

20000

50

0

0 Set1 Set2 Set3 Set4 Set5

Set1 Set2 Set3 Set4 Set5

Figure 10: Average first pass reading time 



Figure 11: Average total reading time

Total reading time Total time taken to read the text. Its the sum of all fixations in a region, both forward and regressive movements. Figure 11, shows this. Reading time per character Total reading time / No. of characters in the text. This figure says on an average how much time is taken for reading a character. Fugires 12a and 12b show the average reading time per syllable and character respectively.

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Hindi Telugu

Set1 Set2 Set3 Set4 Set5

Figure 12a: Average reading time per syllable

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

Hindi Telugu

Set1 Set2 Set3 Set4 Set5

Figure 12b: Average reading time per character

These results show that in case of intra-syllable jumbling (sets 2 and 3), vowel jumbling (set 2) is relatively easy to predict than consonant jumbling (set 3). And in case of inter-syllable jumbling (sets 4 and 5), short distance jumbling (set 4) is relatively easy to predict than long distance jumbling (set 5). Out of all four kinds of jumbling (sets 2 to 5), inter-syllable short distance jumbling (set 4) is the easiest to predict and intra-syllable consonant jumbling (set 3) is the hardest to predict. Same goes with the reading time as well.

Language Specific Observations Indian languages like Telugu and Hindi diverge from English at various levels. One such level is orthographic. The Devanagari script and the Telugu script, because of their characteristics, allow for different types of jumbling. In the experiments above we jumbled only vowels (Set2), only pure consonants (Set3), syllables (4 and 5). We tried to jumble the words at the phonemic level but we had to discard it because it made just no sense (Set0). We saw that short-distance syllable jumbling lends itself for easier reading than other kinds of jumbling. For very long words the subjects tend to guess at the morpheme level and try out various combinations. Due to the agglutinative nature of Telugu, the average word length is more than that of English. This is reflected in the average time taken to read a word, which is greater than that of English. For English this is 3 wps, and for Telugu it’s 1.3 wps. Also relevant is the basic syntactic structure of Telugu and Hindi. Unlike English these are SOV languages. In both these languages the verb agrees (in gender, number and person) with the subject of the sentence, this information appears on the verb as suffixes or auxiliaries. We must note here that the distance between the verb and the subject will be greater than English. We observed earlier that this information is generally skipped while reading. This means that in spite of this long distance the subjects are able to retain this information when they encounter the main verb of the sentence. We still need to explore the upper limit of this distance to see how it is related to the subject’s retention memory. Both Telugu and Hindi make extensive use of post-positions and suffixes to mark the grammatical/thematic roles of nouns in a sentence. Like the agreement features these function words are also frequently skipped by the subjects during the

saccades. What is interesting about this characteristic is that some of these postpositions/suffixes help in predicting the lexical choice and morphological properties of the verb. Once a word has been guessed (correctly or incorrectly) the subjects were faster at making their prediction in case of repetition of that word. Interestingly, if the same word ended up being jumbled in different ways in the same text, the subjects mostly were able to identify it as the same word. Also, the subjects usually stick to their first guess, if that guess seems to fit in the present context. Only if their first prediction does not seem right (based on the type of the context described earlier) do they make the second guess. While reading the text, the subjects almost consistently skip the function words, i.e. do not focus on it. The reading also involves different kinds of operations on words, namely merging, addition, substitution. Below we give some examples for each of these operations: Telugu, addition: అంతకదు (not only) when jumbled as ‘అంతకద’ read as ‘అంతకముంద’ (before hand). The ‘ముం’ was added. Telugu, merging: ష పదహను (Shah 15) when jumbled as ‘ష పదహన’ read as ‘షజహను’ (Shah Jahan). Hindi, substitution: , (comma) read as ‘और’ (and). Hindi, merging: ‘जाना जाता’ (is believed) read as ‘जानता’ (knows).

Proof/Disproof of Hypothesis A.

Human mind does not read every letter by itself but the word as a whole even in Indian Languages. This hypothesis holds true even in case of Indian Languages also for familiar words. In case of un-familiar words it is not completely true. Based on complexity of the word, sometimes letter/syllable by letter/syllable reading was done.

B. Do we jump while reading a text with jumbled intermediate letters in a word? We can't say completely. Jumping depends on familiarity. If word is familiar we jump, but if the word is un-familiar we go akshara by akshara. C. Nature of jumbling and the length of the word effect the reading time and errors made. Yes. Our results prove this.

Future work With the insights from this experiment we are planning to do the following experiments in future. 



 

Experimenting with misspelled words instead of jumbled words. Experimenting with misspelled words in place of skipped/easily guessed words. Exploring factors effecting agreement in Indian Languages. Exploring cognitive aspects of grammar and their impact on parsing.

Conclusion: In this work we tried out various reading experiments where texts from two Indian languages were jumbled. It was observed that in spite of the jumbling, subjects consistently read the text correctly, although the time taken by them to read different texts varied. Several interesting results came to the fore, and some of them were because of the characteristics of Indian languages. We plan to perform further experiments to elaborate and consolidate our observations and results in more detail.

Aknowledgements: I would like to thank Anupama Gali for her invaluable help in extracting the .xls files using data viewer software. Entire parameter based analysis is done using these .xls files. Without her it would have been very difficult to extract the .xls and hence the further analysis.

References: Ashby, J., Rayner, K., & Clifton, C.J. (2005). Eye movements of highly skilled and average readers: Differential effects of frequency and predictability. Quarterly Journal of Experimental Psychology, 58A, 1065-1086. Clifton, C., Jr., Staub, A., & Rayner, K. (2007). Eye movements in reading words and sentences. In Van Gompel, M. Fisher, W. Murray, and R. L. Hill (Eds.), Eye movement research: A window on mind and brain . Oxford: Elsevier Ltd. pp. 341-372 Deutsch, A., & Bentin., S. (2001). Syntactic and semantic factors in processing gender agreement in Hebrew: Evidence from ERPs and eye movement. Journal of memory and Language , 45, 200-224. Eye Movement in Language Reading http:// wapedia.mobi/en/Eye_movement_in_language_reading Eye Tracking Glossary http://eyetracking.oneupweb.com/resources/glossary/ Larson, K. (2007). The Science of Word Recognition or how I learned to stop worrying and love the bouma. http://www.microsoft.com/typography/ctfonts/WordRecognition.aspx Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin , 124, 372–422. Perea, M., & Lupker, S. J. (2003a). Transposed-letter confusability effects in masked form priming. In S. Kinoshita and S. J. Lupker (Eds.), Masked priming: State of the art (pp. 97120). Hove, UK: Psychology Press. Perea, M., & Lupker, S. J. (2003b). Does jugde activate COURT? Transposed-letter confusability effects in masked associative priming. Memory and Cognition.

Appendix: A) Hindi

hin1

hin2

hin3

hin4

hin5

B) Telugu

tel1

tel2

tel3

tel4

tel5

Effect of jumbling intermediate words in Indian languages

Average no. of syllables per word. 2.73 .... result more number of words are skipped. .... 1.3 wps. Also relevant is the basic syntactic structure of Telugu and Hindi.

2MB Sizes 0 Downloads 211 Views

Recommend Documents

Effect of jumbling the order of letters in a word on ... - web.iiit.ac.in
Experiments are based on different types of word jumbling. Results from these ... apply to phonetic languages as well and, if not, how they may be modified or .... theme, etc. from the previous 2-3 sentences were used to make prediction. In one ...

Annotations for Portable Intermediate Languages
2 Email:[email protected] ... 2 Annotations for Low-Level Optimizations. Compiler ... x cannot be allocated to a scratch (caller-saves) register, since this register.

DGE - A - B JUMBLING METHOD.pdf
vdnt, Ï›étu¤Âid j¦fë‹ MSif¡F£g£l nj¬Î ika Kj‹ik. f©fhâ¥ghs®fS¡F bjçé¤J, mt®fsJ nj®Î ika miw¡. f©fhâ¥ghs®fS¡F ÏJ F¿¤J m¿ÎW¤JkhW nf£L¡bfhŸs¥gL»wJ. /c.e.c.m./ x«/-.

Issues in Minority Languages in India - University of Graz
and (b) their functional transparency in the various domains of society. Minority languages are typically those which carry relatively less or marginal functional.

Languages In Connection.pdf
Eastern language families have. historically included .... This existence of parallel language. registers (of ... Displaying Languages In Connection.pdf. Page 1 of 20.

Languages of Myanmar
Many people have put effort since long time ago to develop Myanmar Character Codes and Fonts but ... Latin scripts Sample website:6. Kayin/ Karen Tibeto-. Burman. Kayin. (Karen). State ..... 8 www.kaowao.org/monversion/index.php.

SWIMMING IN WORDS
E. Tognini Bonelli (eds), Text and technology: in honour of John Sinclair. ... 171-179. Halliday, M.A.K. 1992. "Language theory and translation practice". ... In A. Wilson and T. McEnery (eds), Corpora in language education and research: a.

Transfer of Refractory States across Languages in a ...
account would hold that bilinguals have two independent language systems, ... representation account (e.g. Chen & Ng, 1989; Dufour & Kroll, 1995; Frenck.

The Nonuniform Syntax of Postverbal Elements in SOV Languages ...
in Indic languages are not derived in a uniform way and that the wh- scope restriction needs to be considered independently of the syntax of nonclausal PVEs.

Diffusion of hydrocarbons in confined media - Indian Academy of ...
understood to a good degree due to investigations carried out during the past decade ... interesting insights into the influence of the host on rotational degrees of ...

Developmental dyslexia in different languages
However, a quantitative meta-analysis using the same database showed ..... the same software was used to control the experiment and to collect naming re-.

In search of Indian records of Supernovae
However, Indian method of recording is far more complex ... In India, mainly two calendar systems were followed i.e. vikram samvat and shalivahan shaka.

Diffusion of hydrocarbons in confined media - Indian Academy of ...
e-mail: [email protected]. Abstract ... confined systems are barely understood unlike in the case bulk fluids. .... changes from that of the bulk benzene.

Effect of mode of delivery in nulliparous women on neonatal ...
Effect of mode of delivery in nulliparous women on neonatal intracranial injury..pdf. Effect of mode of delivery in nulliparous women on neonatal intracranial ...

The Effect of Motion Dynamics in Calculation of ...
Detailed studies have investigated the dynamic effects of locomotion, as well as many fast-paced sports motions, on physiological loading. This study examines the significance of considering the dynamics of simulated industrial handwork when calculat

The Effect of Caching in Sustainability of Large Wireless Networks
today, [1]. In this context, network caching has a key role, as it can mitigate these inefficiencies ... nectivity, and deliver high quality services as the ones already.

Table of Key words controlling Authorised Capital Key Words in the ...
International, Globe, Global, Universal,. Universe, Continental, Inter-Continental,. Asiatic, Asia, Asian, World being the first word of the name. 1 Crore. If any of ...