Powerful Tea and Strong Computers: What do the Egyptians Have to Say? Emad Soliman Mohamed

I- Introduction The current paper investigates the possible difference between two varieties of English: American English and Egyptian English. While English is not widely spoken in Egypt, there exist some papers and publication whose main language is English, and they enjoy wide circulation. The study will focus on measurable differences in pursuit of an answer to the question: Is the English used by the Egyptians different from that used by the Americans? And if yes, how different is it?

II- Hypothesis of the Study The study hypothesizes that Egyptian English is different from American English. This general statement can be further illustrated in the following subhypotheses: (1) Egyptians, as non-native speakers of English, have a smaller vocabulary size than Americans. (2) Egyptians tend to use more formal words than Americans do. (3) Egyptians tend not to use the structures not in their mother tongue.

III- Methods In order to examine the hypothesis above, I have collected two mini-corpora: a Newsweek corpus and a Weekly corpus. Newsweek is an American magazine that is taken to represent American English, and the Weekly is a weekly English newspaper issued in Egypt by Al-Ahram foundation which issues Al-Ahram, the oldest and biggest newspaper in Egypt. Each mini-corpus contains approximately 210,000 words of English. Both corpora have been subject to pre-processing where they were turned into lower case letters and punctuation was removed. In order to guarantee that the English provided is really representative of the varieties in question, I have tried to select only people whose native language is not English, in the Weekly case, and people whose native language is English, in the Newsweek case. This 1

was not much of a problem in selecting from the Weekly as I did not include the writings of anyone with a non-Egyptian name. This is especially important as the Weekly has contributors from around the globe. This was more difficult for the Newsweek as American names display all kinds of variation.

I used a Python program which I wrote especially for this task to extract the information needed. All the statistics were carried out by SPSS. The Python code and the two minicorpora are available upon request.

IV- Definition of terms: In the following section I present the important terms in this study and try to define them in light of the literature available. The definitions are in order of the hypotheses above.

IV.1. Lexical Richness Vocabulary richness measures the number of distinct words (types) in a corpus of N word tokens. The Stanford Encyclopedia of Philosophy maintains that ‘The distinction between a type and its tokens is an ontological one between a general sort of thing and its particular concrete instances’ 1 . Vocabulary richness can be used in text classification (Wimmer and Altmann 1999), sociolinguistic studies e.g. the vocabulary of lower class versus upper class (Sankofff and Lessard, 1975) authorship attribution (Hoover 2003). It could also be used for judging text quality as Engber 2 (1993) and Lunnarud 3 (1986) found a substantial correlation between lexical variation and a holistic measure of quality. The question whose answer I’m interested in by studying the type / token ratio of Americans versus Egyptians is whether Egyptians have attained the same level of knowledge Americans have of the English language. Another kind of question is whether Egyptian journalists use different types than those used by their American counterparts. For this purpose I will measure the intersection as well as the difference of the types.

1

http://plato.stanford.edu/entries/types-tokens in Laufer and Nation (1995) 3 in Laufer and Nation (1995) 2

2

Although the type token ratio (TTR) “is known to be very sensitive to text length – as a text gets longer, new word types are introduced at a slower rate” (Grieve, 2007), I have tried to make TTR more useful by limiting the corpus size in both cases to 210,000 words and selecting texts that contain very similar information, in the case the corpora are taken from the political sections with the Middle East being the main theme. Testing the intersection of types may also give an indication of type similarity between both newspapers. This does, however, not the two problems Laufer and Nation (1995) pointed about the use of TTR in the context of language learning: (1) TTR is dependent on the definition of a word, “a learner who used many derived forms of a few families would not be distinguished from a learner who used a lot of different families.”, and (2) TTR does not distinguish between high frequency types and low frequency types. Other tests, besides TTR, can be used to measure the lexical richness of a text. Wimmer and Altmann (1999) review the theory behind those tests and the values of those test indices mean in terms of the correlation between language development in children and lexical richness, classification of text based on lexical richness and whether there is a correlation between lexical richness and intelligence. In their discussion of the different indices (e.g. Chi-square, entropy, repeat rate, and variation co-efficient), Wimmer and Altmann (1999) maintain that the value of the test means nothing by itself although it could rank texts in comparison. “The value of the index is not worth more than an ordinal number showing the rank of the text in a group of others.

Entropy is another test used for lexical richness. In using entropy for diversity, the point is that if entropy is higher, this might suggest that there is more uncertainty in the text, which might indicate a less formulaic style, whereas lower entropy might indicate a rather formulaic, or less diverse, style. According to Oakes (1998: 58), “if a situation is wholly organized, not characterized by a high degree of randomness or choice, the information or entropy is low”. It can be seen that entropy is conceptually related to lexical richness and diversity, but Thoiron (1986) advises against using it for this particular purpose because it is

3

inconsistent. Thoiron ran experiments to test both entropy and Simposin’s index, but found them both defective as they had many blemishes when applied to lexical diversity. For the purpose of this paper, I will use TTR as a mesure of lexical richness in the hope that the controlling the genre and the text length will make it more reliable. IV.2. Formality Level To test whether Egyptian journalists use more formal words, I have decided to use word length as an indicator of formality. Wright and Hope (2007:208) and Kroll (1990:198) maintain that formal words tend to be longer since they are derived from Latin origin while informal words are shorter due to their Anglo-Saxon origin. Word length has, inter alia, been used in a related way for document ranking in information retrieval tasks (Braslavski, and Tselischev 4 ). A related measure, Flesch Reading Ease test 5 , uses the number of syllables per word to measure the readability level of a certain text. . A more reliable option would be to compare the mini-corpus with an annotated corpus that has formality tags, but such a corpus does unfortunately not seem to exist, and even if such a corpus existed, a model would be needed to disambiguate the formality tags as one word may be assigned to either formal, informal, or possibly neutral. Francis Heyleighen and Jean-Marc Dewaele 6 did, however, use another corpus-based measure to test for the formality of language. The test is based on the assumption that certain word categories are more common in formal language while others are common in informal language. For example, “pronouns, adjectives, articles and prepositions are more frequent in formal styles; pronouns, adverbs, verbs and interjections are more frequent in informal styles.” Francis Heyleighen and Jean-Marc Dewaele define an Fscore as a measure of formality which subtracts the frequencies of informal words from those of formal ones. The higher the score, the more formal the text. The F measure is calculated as follows: F = (noun frequency + adjective freq. + preposition freq. + article freq. – pronoun freq. – verb freq. – adverb freq. – interjection freq. + 100)/2

4

WWW document retrieved from the authors’ websites on 11/07/2007: http://www.rcdl2005.uniyar.ac.ru/ru/RCDL2005/papers/sek7_1_paper.pdf 5 http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test 6 WWW document retrieved from the authors’ websites on 11/1/2007: http://pcp.lanl.gov/Papers/Formality.pdf

4

The authors report successes with Dutch, English, French and Italian. The measure seems to reduce formality to mere frequencies, which the authors admit, but they claim that there is a correlation between categories and the level of formality. I believe we need to know how many verbs, for example, are intrinsically formal, or informal, and what the contexts of this formality might be. I will, however, not use the F-measure for measuring formality in the Egyptian American contrast due to the fact that the corpora I am working with are no more than plain text, while the test requires a POS-tagged corpus.

V. Analysis and Discussion V.1. Lexical Richness The Weekly corpus has 210375 tokens and 18218 unique word types with a TTR of 0.0894 while the Newsweek corpus has 209214 word tokens and 20235 unique word types with a TTR of 0.0967. One might conclude here that Americans have a richer vocabulary than the Egyptians, but this would, in fact, be an oversimplification. It turns out that there are 8860 word types in the Weekly corpus that are not in the Newsweek corpus, and 10877 word types in the Newsweek corpus that are not in the Weekly corpus. Since the number of types in the Newsweek corpus is greater than the number of types in the Weekly corpus and Newsweek – Weekly is > Weekly – Newsweek one can informally reject the null hypothesis that Egyptians have a vocabulary size that is larger than or equal to that of the Americans. The word informally is important here since this is only a pilot study.

V.2. Formality Level We hypothesized earlier that Egyptians use more formal words than Americans do. I have used the ANOVA test to find out whether there is a difference between word lengths in the Newsweek and the Weekly. The average word length in the Newsweek is 4.88 letters with a standard deviation of 2.69. The average word length in the Weekly is 5 letters with a standard deviation of 2.76. I obtained a significant F score of 192.94 (p <.001) which means that the group with the higher mean is significantly longer than the other. The results confirm the hypothesis that Egyptians use longer words than Americans do, and 5

hence use more formal words than Americans do. The following two tables summarize the results: ANOVA freq

Between Groups Within Groups Total

Sum of Squares 1435.714 3122227 3123663

df 1 419587 419588

Mean Square 1435.714 7.441

F 192.941

Sig. .000

Table 1: Results of the ANOVA test for word length Descriptives freq

N Weekly 210375 Newsweek 209214 Total 419589

5% Confidence Interval fo Mean Mean Std. DeviationStd. ErrorLower BoundUpper Bound Minimum Maximum 5.00 2.763 .006 4.98 5.01 1 28 4.88 2.692 .006 4.87 4.89 1 34 4.94 2.728 .004 4.93 4.95 1 34

Table 2: Descriptive Statistics about word length

Although we cannot go through all the 419589 to see which words Egyptians tend to use, we can nonetheless have a look at some examples: in my search for specific structures I noticed that Egyptians tend to use longer, or multiword, subordinators. For example, notwithstanding is used only once in the Newsweek versus 5 in the Weekly, nevertheless is used once in the Newsweek versus 15 in the Weekly, nonetheless: 4 versus 17, and insofar as is not used at all in the Newsweek but occurs twice in the Weekly corpus. Subordinators occur 3340 times in the Newsweek versus 3015 times in the Weekly 7 .

7

: the counts go as follows:

Newsweek: {'so that': 19, 'because': 302, 'unless': 19, 'in so far as': 0, 'though': 203, 'whenever': 6, 'wherever': 4, 'whereas': 2, 'since': 144, 'when': 574, 'nonetheless': 4, 'nevertheless': 1, 'while': 158, 'whether': 101, 'although': 42, 'notwithstanding': 1, 'insofar as': 0, 'if': 1320, 'after': 309, 'before': 173} There are 3340 subordinators in the text

6

Another thing worth noting which might contribute to the study of formality level is the use of contracted forms, which are generally accepted to be “primarily a feature of spoken, informal English” (Axelsson, 1995). The Contracted forms are those forms like “he’s” and “she’s where he is is contracted. Contracted forms are well-represented in the Newsweek corpus while almost non-existent in the Weekly Corpus. The absence of this structure might gives evidence to the hypothesis that American writers employ more informal words than Egyptians do. The forms chosen are: "he’s", "he’ll", "he’d", "she’s", "she’ll", "she’d", "I’m",

"I’d",

"you’ll", "you’re",

"you’d", "we’ll", "we’re", "we’d", "it’s", "it’ll", "it’d", "they’ll", "they’d", "they’re", "someone’s", "somebody’s", "everyone’s", "everybody’s", "who’s", "who’ll", "where’s", "when’s", "what’s", "what’ll", "why’s", "here's", "there's" with their respective counts being: [86, 6, 10, 45, 7, 9, 95, 9, 16, 61, 11, 8, 85, 1, 311, 0, 0, 13, 8, 85, 3, 0, 0, 0, 8, 0, 3, 0, 33, 0, 0, 17, 87] This shows that those contracted forms occur 1017 times in the Newsweek corpus with it’s being the most common one, while they occur only 153 times in the Weekly corpus with it’s being the most common one with a count of 67. The respective counts are as follows: [6, 0, 3, 0, 0, 0, 15, 0, 5, 10, 1, 4, 21, 2, 67, 0, 0, 0, 0, 2, 0, 0, 1, 0, 1, 0, 1, 0, 7, 0, 0, 0, 7]

We find the same phenomenon and a similar distribution with the contracted negation form, e.g. shouldn’t. The contracted negation forms occur 714 times in the Newsweek and only 237 in the weekly corpus. The non-contracted negative word “not” occurs 923 times in the Newsweek and 939 times in the Weekly.

This has led me to calculate the correlation between word length and word frequency in both mini-corpora in order to see whether they are different or similar. The correlation in the Newsweek is -.1137 compared to -.107 in the Weekly.

Both correlations are

weekly: {'so that': 23, 'because': 188, 'unless': 25, 'in so far as': 0, 'though': 180, 'whenever': 6, 'wherever': 2, 'whereas': 10, 'since': 192, 'when': 358, 'nonetheless': 17, 'nevertheless': 15, 'while': 287, 'whether': 81, 'although': 52, 'notwithstanding': 5, 'insofar as': 2, 'if': 1162, 'after': 296, 'before': 166} There are 3015 subordinators in the text

7

significant and negative which means there is a negative correlation between word length and word frequency, with the correlation being a little stronger in the Newsweek. The graphs do, however, show that the Weekly has much more inconsistency than its American counterpart. While there are very few outliers in the Newsweek, the Weekly has many. This might give some evidence that non-native English may be different from native English on distributional grounds as the Newsweek corpus seems to stick more to the Zipf’s law.

Since correlation is susceptible to outliers, function words were not included in the correlation formula. 40

Len

30

20

10

0 0

200

400

Freq

600

__

Chart 1: Correlation between word length (Len) and word frequency (Freq) in the Newsweek Corpus.

8

30

25

wLen

20

15

10

5

0 0

100

200

300

400

wFreq

500

__

Chart 2 Correlation between word length (wLen) and word frequency (wFreq) in the Weekly Corpus.

V.3. Structural Differences In this section we examine some structures of English that have no equivalent in Arabic, the Weekly journalists’ mother tongue, and see whether the Egyptians use them as much as the Americans do. The section is motivated more by experience as a learner of English, and later as a teacher as well, than by any theoretical grounds.

V.3.1. The modal + have + past participle structure: Since the two mini-corpora I have are not annotated, I tried to extract the information by defining a function that does the job. First I short listed (all) the irregular past participles, the modals and the negatives. The function then has the job of finding (a) a modal + have + past participle, or (2) a modal + negative + have + past participle. Since been is in the irregular past participles short list, the structure in He may have been doing something can also be captured. The program, with proper documentation, is available upon request. The structure occurs 111 times in the Newsweek corpus with 61 instances of usage with irregular verbs, which may be explained by the fact that most irregular verbs are high

9

frequency ones. The structure occurs 85 times in the Weekly corpus with 49 instances of usage with irregular verbs. The fact that Egyptians use the structure less frequently than Americans do may be attributed to the non-existence of a similar structure in Arabic, in the same vein of phrasal verbs being avoided by the Chinese but not by the Dutch since they occur in Dutch but not in Chinese. (Liao et al 2004)

V.3.2. The by-passive structure The by passive does not occur in Arabic as the concept of passive in Arabic implies that the agent is not known. In fact, in Arabic the term can be translated as ‘the verb built for the unknown agent’. Although the passive structure is not difficult per se, it presents a new concept that may be difficult for some Arabic-speaking writers to uphold. In the Newsweek corpus the by-passive structure occurs 405 times 355 of which are formed with regular verbs and 50 with irregular verbs, while in the Weekly corpus the by-passive structure occurs 620 times, 543 of which are formed with regular verbs while the rest, 77, are formed with irregular verbs. This might b explained by the fact that the passive in English is easy to form, and that the more advanced the learners get, the more inclined they are to use target language forms. 8 This is not the case with the modal structure since that it is inherently hard and multi-part.

V.3.3. Pied-piping The Newsweek journalists make use of the pied piping structure only 94 times, 14 of which with ‘whom’. The Weekly journalists, on the other hand use the structure 224 times with ‘which’ and 30 times with ‘whom’. The numbers show that Egyptian English is substantially different from American English in this respect. It further stresses the point that Egyptian English is more formal than American English since the use of the structure is more common in the written genre (Razai, 2006).

8

I recall that when I started learning English in Egypt, the instructors focused on the passive structure because, they said, the English language makes use of the passive a lot.

10

V.3.4. Double-negation For the purpose of this paper, double negation is the case in which there is a negator (no none not never nobody hardly scarcely) followed by a negated adjective, i.e. an adjective staring with un-, im-, or ir-, with the adjective following the negator or with a word intervening. This capture phrases like not uncommon and not really uncommon. It turns out that there are only two instances of negator + un in the Newsweek: not unknown and not necessarily unhappy, one instance of negator + im-: not impossible to obtain, and no instances of negator + ir-. This might mean that the structure is rare in writing. In the Weekly the negator + un- occurs 4 times: not unlike (1) and no uncertain (3), but does not occur with either –im or –ir. The numbers show that the structure is rare in general. Maybe it’s a spoken phenomenon. V.3.5. Too….to In the Newsweek the too …to structure occurs 17 times versus 15 times in the Weekly. There does not seem to be a difference here. VI. Conclusion The study has shown that that there exist some lexical and structural differences between the English used by the Egyptians and American English. This information could be used in designing syllabi for language education for example where the focus is on writing as a native speaker of the language. The study does nonetheless suffer from limitations that I intend to remedy in future research. For example, depending on plain text has limited the domain of structures that could be extracted. I intend to parse the corpus next time with a tagger like TnT (Brants, 2000) in order to get more valuable information. Also, the small size of the corpus might have been an obstacle to finding more representative lexical and syntactic differences. This can be solved by incrementing more texts fro varying genres.

11

References Axelsson, M.W. (1995). Contracted forms in newspaper language: Inter- and intratextual variation, ICAME Journal No. 20 Pavel Braslavski, Andrey Tselischev. Experiment on Style-Dependent Document Ranking. WWW document retrieved on 10/25/07. http://www.rcdl2005.uniyar.ac.ru/ru/RCDL2005/papers/sek7_1_paper.pdf Heylighen, Francis and Dewaele, Jean-marc. Formality of Language: definition, measurement and behavioral determinants. WWW document retrieved 10/23/2007 http://pespmc1.vub.ac.be/Papers/Formality.pdf Hoover, L. (2003). Multivariate Analysis and the Study of Style variation. Literary and Linguistic Computing, Vol. 18, No. 4. Grieve, Jack (2007). Quantitative Authorship Attribution: An Evaluation of Techniques. Literary and Linguistic Computing, Vol. 22, No. 3, 2007 Kroll, Barbara (1990). Second Language Writing: Research Insights for the Classroom. Cambridge University Press Laufer, Batia and Nation, Paul. (1995). Vocabulary Size and Use: Lexical Richness in L2 Written production. Applied Linguistics, Vol. 16, No. 3, Oxford University Press, 1995. Liao, Yan; Fukuya, Yoshinori J. Avoidance of Phrasal Verbs: The Case of Chinese Learners of English. Language Learning, Volume 54, Number 2, March 2004 , pp. 193226(34) Manning, Christopher and Schuetze, Hirish (1999). Foundations of Statistical Natural Language Processing. The MIT Press. Okaes, Michael P. (1998). Statistics for Corpus Linguistics (Edinburgh Textbooks in Empirical Linguistics). Edinburgh University Press Sankoff, D. and Lessard Regean (1975) Vocabulary Richness: A Sociolinguistic Analysis. Science, New Series, Vol. 190, No. 4215. (Nov. 14, 1975), pp. 689-690. Razai, M. J. (2006). Preposition Stranding and Pied-Piping in Second Language Acquisition. Essex Graduate Student Papers in Language & Linguistics - Volume VIII 2006 Thoiron, Philippe (1986). Computers and the Humanities. V. 20 (1986)

12

Wimmer, Geijza and Altmann, Gabriel (1999). Review Article: On Vocabulary Richness. Journal of Quantitative Linguistics. 1999, Vol. 6 No. 1. PP. 1-9 Wright, L. and Hope, J. (2007). Stylistics: A practical coursebook. Taylor & Francis, First edition.

13

Powerful Tea and Strong Computers: What do the ...

I used a Python program which I wrote especially for this task to extract the ... Other tests, besides TTR, can be used to measure the lexical richness of a text.

277KB Sizes 5 Downloads 105 Views

Recommend Documents

Powerful Tea and Strong Computers: What do the ...
new concept that may be difficult for some Arabic-speaking writers to uphold. In the Newsweek corpus the by-passive structure occurs 405 times 355 of which are formed with regular verbs and 50 with irregular verbs, while in the Weekly corpus the by-p

Download What Was the Boston Tea Party? Books ...
... Free Bibliography amp Citation Maker MLA APA Chicago HarvardSearch the ... Con’s After Parties we invite you to use a great new app called NOBLE.

Exchange Rate Policy and Liability Dollarization: What Do the Data ...
and exchange rate regime choice, determining the two-way causality between these variables remains .... present the data and the empirical framework, and then we report the results and robustness ...... explanations to this interesting finding.

What Do Philosophers Believe? - PhilPapers
Nov 30, 2013 - Survey was advertised to all registered PhilPapers users (approximately 15,000 ... PhilPapers website and in other places on the web. .... of Religion, Philosophy of Social Science, Philosophy of the Americas, Social and.

What Do Philosophers Believe? - PhilPapers
Nov 30, 2013 - sultants for their help with survey design. For feedback on this paper, .... The PhilPapers Survey was conducted online from November 8, 2009 to December 1,. 2009. ... Free will: compatibilism, libertarianism, or no free will? 8.

Download PDF What To Do When Machines Do ...
new technologies are changing how value is created. Written by a team of business and technology ... your career—get left behind. What To Do When Machines.

What Do Multiwinner Voting Rules Do? An Experiment ...
In this work, we focus on the former aspect and ask what multiwinner rules do ... from a set of candi- dates that is representative of the electorate; one might call.