anglia ruskin university a study into the use of linguistic ...

Viewer
Transcript

ANGLIA RUSKIN UNIVERSITY A STUDY INTO THE USE OF LINGUISTIC STRUCTURES USED INTER-GENDER AND INTRA-GENDER IN THE TV SHOW ‘FRIENDS’ DAVID AYLIFFE A dissertation in partial fulfilment of the requirements of Anglia Ruskin University for the degree of Master of Arts in Applied Linguistics and TESOL

Submitted: July 2011

Anglia Ruskin University MA Applied Linguistics with TESOL

Acknowledgements It is with gratitude that I would like to thank Dr. Sebastian Rasinger and the rest of the teaching staff at Anglia Ruskin University. This project was supervised by the aforementioned Dr. Rasinger and his advice and support have been invaluable.

I am indebted to the fans of the TV show ‘Friends’ who faithfully and painstakingly transcribed the entire catalogue of episodes. Their accurate transcription considerably accelerated the project and gave me much more time on performing a detailed analysis of the results than would have otherwise been possible. Undoubtedly without these transcribed episodes the project would not have happened.

i

Anglia Ruskin University MA Applied Linguistics with TESOL

ANGLIA RUSKIN UNIVERSITY ABSTRACT FACULTY OF ARTS, LAW AND SOCIAL SCIENCES MASTER OF ARTS A STUDY INTO THE USE OF LINGUISTIC STRUCTURES USED INTER-GENDER AND INTRA-GENDER IN THE TV SHOW ‘FRIENDS’ By David Ayliffe July 2011

In this thesis I present a corpus-based analysis of the use of various linguistic structures and a comparison of their use in inter-gender and intra-gender conversation environments in the TV show ‘Friends’. There is a great body of research which indicates that female language is more polite than males and the overarching conceptual aim of this project was to validate or invalidate that hypothesis. The results of this study have revealed a number of interesting results; the most prominent of which is not that one genders linguistic style is different than the others but that both men and women align themselves closely depending on the company. In other words the styles both men and women employ differ greatly depending on whether or not they are talking to one of their own or inter-gender. There was no support that women use more ‘women’s language’ and the view that the stereotypes of women being linguistically restricted was not upheld, although it was shown that while men generally have a broader vocabulary and use it more creatively. Despite this, from the use of intensifiers to taboo words women were found to be a country mile ahead of masculine counterparts, while in other features, from empty adjectives to the number of questions asked, colours to hedges, both genders were found to operate on remarkably similar levels.

ii

Anglia Ruskin University MA Applied Linguistics with TESOL

TABLE OF CONTENTS 1.

INTRODUCING THIS STUDY........................................................................................... 1 1.1

THE RESEARCH QUESTION ............................................................................................ 1

1.2

DEFINING POLITENESS ................................................................................................ 2

1.2.1 What is polite language?.................................................................................... 3

2.

1.3

THE NEED FOR A CORPUS -BASED STUDY OF LANGUAGE AND GENDER IN US TV .......................... 4

1.4

THE APPROACH TAKEN TO THIS STUDY............................................................................. 7

BACKGROUND INVESTIGATIONS RELEVANT TO THIS STUDY .......................................... 8 2.1

DISCOURSE-RELATED LITERATURE .................................................................................. 8 Birmingham Model ................................................................................................... 9 Systemic Functional Linguistics ................................................................................ 10

2.1.1 Speech transcription ........................................................................................ 11 2.2

SEX AND GENDER .................................................................................................... 14

2.2.1 Politeness........................................................................................................ 17 2.3 3.

CORPUS RELATED LITERATURE..................................................................................... 22

BUILDING AND USING THE CORPUS ............................................................................ 31 3.1

CREATING THE CORPUS ............................................................................................. 31

3.1.1 The database................................................................................................... 33 3.1.2 ‘Tidying’ the data............................................................................................. 34 3.1.3 Tagging the data ............................................................................................. 36 3.1.4 Determining and assigning gender.................................................................... 37 3.1.5 The integrity of the data................................................................................... 38

4.

3.2

OBTAINING THE RESULTS ........................................................................................... 40

3.3

A SUMMARY OF THE DATA AND WHAT IT LOOKS LIKE ......................................................... 44

AN ANALYSIS AND DISCUSSION OF THE RESULTS ........................................................ 47 4.1

HEDGES ................................................................................................................ 47

4.2

TAG QUESTIONS ..................................................................................................... 52

4.3

ASKING QUESTIONS ................................................................................................. 55

4.4

TABOO LANGUAGE .................................................................................................. 57

4.5

EMPTY ADJECTIVES .................................................................................................. 61

4.6

INTENSIFIERS.......................................................................................................... 63

4.7

V OCABULARY ......................................................................................................... 65

4.8

BACK CHANNELLING ................................................................................................. 70 iii

Anglia Ruskin University 5.

MA Applied Linguistics with TESOL CONCLUSIONS............................................................................................................ 71

6.

LIMITATIONS OF THE STUDY....................................................................................... 73 6.1

SEMANTIC ANNOTATION........................................................................................... 74

6.2

CONNECT BY PRIOR ................................................................................................. 75

7.

RECOMMENDATIONS FOR FURTHER STUDY ................................................................ 76

8.

BIBLIOGRAPHY........................................................................................................... 77

9.

APPENDIX .................................................................................................................... 1 9.1

QUERY FOR RETRIEVING TAG QUESTIONS ......................................................................... 1

9.2

QUERY FOR RETRIEVING THE BREAKDOWN OF TAG QUESTIONS............................................... 1

9.3

QUERY FOR RETRIEVING THE NUMBER OF QUESTIONS ASKED ................................................ 2

9.4

QUERY FOR RETRIEVING THE STATISTICS FOR HEDGES.......................................................... 2

9.5

QUERY FOR RETRIEVING THE MOST FREQUENT ADJECTIVES.................................................... 2

9.6

QUERY FOR RETRIEVING THE MOST FREQUENT NOUNS ......................................................... 3

9.7

QUERY FOR RETRIEVING THE MOST FREQUENT VERBS .......................................................... 3

9.8

QUERY FOR RETRIEVING THE AVERAGE UTTERANCE LENGTH .................................................. 3

9.9

QUERY FOR RETRIEVING ALL SINGLE SEX WORDS ................................................................ 3 QUERY FOR RETRIEVING THE ............................................................................................ 4

9.10

DIACHRONIC USE OF ‘REALLY’, ‘VERY’, ’SO’ ....................................................................... 4

Frequency Counts -Simple ......................................................................................... 4 Frequency Counts - Complex ..................................................................................... 4 9.11

QUERY FOR RETRIEVING THE NUMBER OF EMPTY ADJECTIVES ................................................ 5

9.12

QUERY FOR RETRIEVING THE NUMBER OF PRONOUN REFERENCES .......................................... 5

9.13

QUERY FOR RETRIEVING THE NUMBER OF TABOO WORDS ..................................................... 6 General Counts by gender ......................................................................................... 6 General Counts by actor............................................................................................ 6 Counts per actor per scene ........................................................................................ 7 Counts per individual taboo word .............................................................................. 7

9.14

QUERY FOR RETRIEVING THE COLOURS USED BY THE GENDERS ............................................... 8

9.15

QUERY FOR RETRIEVING THE BACK CHANNELLING VOCABULARY.............................................. 9

9.16

THE CORPUS ............................................................................................................ 9

10.

COPYRIGHT ............................................................................................................ 10

iv

Anglia Ruskin University MA Applied Linguistics with TESOL

TABLE OF FIGURES Figure 1: Transcribing a fit of crying (Hepburn, 2004, p.252) ................................................ 12 Figure 2: Diachronic distribution of non-standard was among men and women in York (from Tagliamonte, p.182) X axis represents generation. ............................................................... 15 Figure 3: Wolfson's bulge model (1998)............................................................................... 17 Figure 4: The BNC Spoken Corpus regions (Crowdy, 1993, p.260) .......................................... 23 Figure 5: BNC Spoken Corpus ‘Conversation Log’ (Crowdy, 1993, p.260) ............................... 24 Figure 6: The TEI markup of a punchine (own material) ........................................................ 26 Figure 7: Phrase Detectives (http://www.phrasedetectives.org) ........................................... 29 Figure 8: reCAPTCHA (from Google, 2011) ........................................................................... 29 Figure 9: HTML file as downloaded. (Excerpt from 0102.html) ............................................. 32 Figure 10: Text file sans HTML mark-up (Excerpt from 0102.txt) ........................................... 32 Figure 11: Utterances before being loaded into the database ............................................... 32 Figure 12: Tidy data (not genuine utterances)...................................................................... 34 Figure 13: User interface for the BNC .................................................................................. 43 Figure 14: Custom BNC User interface at http://corpus.byu.edu/bnc/................................... 43 Figure 15: Links between data ............................................................................................ 46 Figure 16: Use of Hedges #1 ............................................................................................... 51 Figure 17: Use of Hedges #2 ............................................................................................... 51 Figure 18: Tag Questions .................................................................................................... 53 Figure 19: Tag Question Details........................................................................................... 54 Figure 20: Asking questions ................................................................................................ 56 Figure 21: Taboo language per actor 1 ................................................................................ 59 Figure 22: Taboo language per actor 2 ................................................................................ 59 Figure 23: Empty Adjectives................................................................................................ 61 Figure 24 Detailed view of negligible 'empty' adjectives ....................................................... 62 Figure 25: Diachronic results use of main elements over a 10 year period. ............................ 64 Figure 26: Vocabulary Breadth. ........................................................................................... 66 Figure 27: Mock query to interrogate both grammatical and semantic representations. ........ 74

v

Anglia Ruskin University MA Applied Linguistics with TESOL

TABLE OF TABLES Table 1: 10 most vocal actors.............................................................................................. 38 Table 2: Data counts .......................................................................................................... 39 Table 3: The first 15 rows of the data .................................................................................. 44 Table 4: Scene Data............................................................................................................ 45 Table 5: Normalising the data. Statistics of “oh” hedge........................................................ 47 Table 6: Corpus Details....................................................................................................... 47 Table 7: Hedges Results...................................................................................................... 48 Table 8: Tag Questions Results............................................................................................ 52 Table 9: Questions ............................................................................................................. 55 Table 10: Taboo words Results ........................................................................................... 57 Table 11: Who uses taboo language? ................................................................................. 57 Table 12: When is taboo language used? ............................................................................. 58 Table 13: Collocates to the left of 'boobs'............................................................................ 60 Table 14: Empty Adjectives Results ..................................................................................... 61 Table 15: Intensifiers Results .............................................................................................. 63 Table 16: Top 10 three-word clusters .................................................................................. 65 Table 17: Vocabulary Results .............................................................................................. 66 Table 18: Unique words per actor (ordered by ‘innovative’ score)......................................... 67 Table 19: Collocates with colours ........................................................................................ 68 Table 20: Gender specific vocabulary (not a definitive list) ................................................... 68 Table 21: Not back channelling ........................................................................................... 70 Table 22: Rachel back channels (scene #24)......................................................................... 70

Word count (excluding all tables, headings and quotes >= 50): 20,809 vi

Anglia Ruskin University MA Applied Linguistics with TESOL

1. INTRODUCING THIS STUDY 1.1

THE RESEARCH QUESTION

This project aims to investigate the gender differences in language use in the TV show “Friends” with an emphasis on the politeness of the language. The transcripts of the episodes will form the corpus around which the project hinges, and the show’s language will be analysed to identify how language both intra-group and intra-group.

The corpus is too large to facilitate detailed conversational discourse analysis at anything but a minute level. However, the use of politeness structures can be analysed from an abstract level. Such politeness structures include, but are not limited to, the use of hedging devices, tag questions, taboo language, the use of ‘empty’ adjectives and an analysis of so-called ‘intensifiers’.

1

Anglia Ruskin University MA Applied Linguistics with TESOL

1.2

DEFINING POLITENESS

Abundant definitions exist of what constitutes politeness, and only three of the most pertinent will be examined here. Lakoff (2004 p.64) states "politeness is developed by societies in order to reduce friction in personal interaction". Her hypothesis that “little girls are *indeed+ taught to talk like little ladies” (p57) ignited the debate between the pragmatic use of language between the genders. Setting the stage for a wealth of further research. Goffman stated that politeness was "the positive social value a person effectively claims for himself by the line others assume he was taken during a particular contact" (1955, p.213). This gave rise to Brown and Levinson’s (1978) ‘politeness theory’ and their definition of FTA’s – face threatening acts has received a great deal of attention due to their concept of both positive and negative face. Negative face is loosely translated at the inhibition of freedom of actions, a restriction on what one can do. Conversely positive face is the desire to be liked, admired, to possess a positive self-image and that crucially this self-image is appreciated and approved of.

Brown and Levinson’s politeness theory has been criticized for assuming that politeness can be expressed in a single act, and Mills (2003) has argued that it is not utterances which are polite or impolite but the context, and the communities of practices which arbitrate over whether something is offensive or not. Politeness, Mills states, needs to be analysed at a discourse level and not a sentence or phrase level. Leech’s view seems to support Mills when he proposed six politeness maxims (1983) which he used to define politeness as forms of behaviour, approaches which can be used to minimize conflict and ensure harmony among conversational participants.

2

Anglia Ruskin University MA Applied Linguistics with TESOL

1.2.1

WHAT IS POLITE LANGUAGE ?

Clearly some language is considered to be more polite than others. Cutting (2002) tells us that Example 1 (below) is more polite than Example 2 due in part to the use of negatives; their use avoids imposing on the hearer and anticipates rejection which helps to make the offer easier to refuse by the hearer. Example two is more direct and hence is considered to be less polite. Polite language is considered to be orthographically longer than impolite language and indeed this was the sole metric Gibson (2009) used in her study 1) I couldn’t borrow $30, could I, if you don’t need it right now? 2) Could I borrow $30? (Cutting, 2002, p.47) “Shall we stop for a drink?” is clearly a request, and linguists (Swan, 2005) would allow us to believe that ‘shall’ is a polite form. However is it so polite that the actual meaning of the utterance is hidden? Did the speaker not want to say “I want to stop for a drink” but the context disallows them? Herein lies the conundrum, the problem of distinguishing language (semantics) from language use (pragmatics) and the search for meaning is a complicated one. In pragmatics, “the meaning is relative to a speaker of user of the language, whereas meaning in semantics is defined purely as a property of expressions in a given language in abstraction form particular situations, speakers of hearers” (Leech, 1983, p6). Therefore if an utterance and situation are inextricably bound up with one another, then the context of the situation is indispensable for the understanding of the words (Malinowski, 1923 cited in Sunderland, 2006). In essence many utterances do or can do, many things at once. Linguists (Holmes, 1983, 1986; Lakoff, 1973; G. Lakoff, 1973) almost routinely butt heads with trying to define the bounds of linguistic structures. Does the use of ‘sorry’ indicate an apology?

The transcript (CNN, 1998) from U.S. President Clinton’s ‘apology’ for his

indiscretions showed that he managed a 4 minute address to the nation to express sorrow for his misdemeanours without ever using the word ‘sorry’. Equally the receptionists’ favourite ‘I’m sorry but he’s in a meeting at the moment’ lacks any sincerity and serves its place as an expected pre-fabricated chunk of language. The pragmatic use of language, (as well as other features such as intonation) can make a semantically impolite sentence polite or the adverse equally true. At a meta-linguistic level politeness is also subjective; subjective by gender, and/or by social occasion - clearly more politeness is required when speaking with one’s peers than to a three year old child, and in this context, it is not the use of discourse markers which makes impoliteness noticeable but the lack of them. 3

Anglia Ruskin University MA Applied Linguistics with TESOL

1.3

THE NEED FOR A CORPUS -BASED STUDY OF LANGUAGE AND GENDER IN US TV

“*The+ non-spontaneous dialogue of a film script tells us little of the way people really talk, but it often represents an ideal to which speakers aspire. It allows us to see what expectations different speech communities have, and what patterns are deemed appropriate for each sex, whether or not particular individuals respect the stereotype, they are familiar with them as competent users of the language, and the ways in which people flout the conventions provide rich material for analysis" (Livia, 2004, p.145, emphasis added).

The prominence of gender and gender-related acts are an effervescent characteristic of today’s society.

The whole advertising industry is built around the premise of women

behaving like ladies while big, burly aggressive men dominate the environment. With Britons now watching more than 30 hours of television a week (Sweney, 2010) the effects of such stereotypes on the impressionable youth are wide-reaching and important. This quote from Livia succinctly highlights the need for this study, the portrayal of genders through the media is an eternal hot potato, an area rife with strong emotions and divided opinions. Many studies have concluded that the genders exemplify their stereotype, representing women as submissive and men as dominant while men rarely exhibit effeminate traits (see Casey et. al, 2008). Such preliminary findings are important given the reach and intrusion into the lounge of these TV shows. Stateside, the episode of Friends immediately following the U.S. Super Bowl attracted in excess of fifty million viewers (Ginsburg, 2004) and the final episode was the most watched [American] TV show of the [2000-2009] decade (Chicago Tribune, 2009).

From Lakoff’s (1973) intuitive assumptions about gender more rounded empirical studies have been conducted. As indicated these have tended to focus on specific areas such as gender prominence in advertisements (Allan & Coltrane, 1996) gender in music videos (SommersFlanagan et. al., 1993), gender stereotypes in children’s television (Browne, 1998), semantic fields of vocabulary in off-the-shelf women’s magazines (del-Teso-Craviotto, 2006), occupational roles (Glascock, 2001) and vocabulary breadth in TV shows (MacFadden et. al., 2009). More works have also looked at linguistically discrete areas such as the use of specific intensifiers (Tagliamonte & Roberts, 2005), swear words (Gao, 2008) or key word clusters (Demmen, 2009). Almost all of these studies (Demmen’s study of Shakespeare the exception) are American-centric and in comparison relatively few studies have been approached from a 4

Anglia Ruskin University MA Applied Linguistics with TESOL British-centric view (for additional exceptions see Furnham, 1993 & Jaworski et. al., 2003). To date, and to my knowledge no other study has been conducted to look specifically into the politeness of language on a TV series; although regarding politeness, Gibson (2009) interestingly analyzed politeness in the fast-food industry.

Importantly it is gender and not sex which is at the centre of this study and it is inter-group and intra-group variation which is being studied – that is the difference in mixed gender and same gender conversation. Intra-group comparisons are useful because they serve to highlight the differences acting almost as a benchmark against which comparisons can be made.

Cameron (2005) has argued for a move away from the “binary difference” (p.490) view of the genders detailing how the construction of gender identity between men and women today has never been more similar. Her references are to social interactions in naturally occurring situations, variables which are void in the glitz of Hollywood’s studios. Therefore this study is not, nor will attempt to be an analysis of language as used in everyday conversations, and few comparisons will be drawn between the two. As Livia has cited, the project hinges around the portrayal of stereotypes and the characters’ alignment with said stereotypes.

Lauzen and Dozier’s (1999) study of the 1995-96 TV season – one year after Friends first aired, reported that women made up just under a third of the behind-the-scenes production team. Just 1 in 5 writers (22%) were female and while this improved marginally for directors and producers (28% and 36% respectively) they were still heavily outnumbered. Female executive producers accounted for less than 20% of the total. In contrast, in Friends just 12 of the 42 writers were female (29%). Two of the 12 (17%) producers were female while just 4 out of the 18 (22%) directors were female responsible for directing only 19 of the 238 episodes (8%).

The Sapir-Whorf hypothesis theorizes that what we do and how we think are influenced by language (Kay & Kempton, 1984). If this is true in its strongest sense, it affords the sinister possibility of a culture controlled by the media. Lippi-Green (1997) studied the language of characters in Disney films and found that heroes and lead characters used received pronunciation (e.g. Robin Hood spoke with a California accent) while foreign accents, often Russian or Arabic, were assigned to those of criminals, villains, servants and country folk amongst others (e.g. King Louis the orangutan in ‘The Jungle Story’ spoke with an African American accent). 5

Anglia Ruskin University MA Applied Linguistics with TESOL Her stance that “beliefs about the way language should be used are passed down and protected in much the same way that religious beliefs are passed along and cherished” (p.15) is important and relevant to this study, it is after all the script writers (along with the odd adlib) who are influential in the language which is broadcast.

As has been identified above (1.2) the problem with defining politeness is a tricky one and any study which attempts it will be flawed in some way. The extent to which the genders “respect their stereotype” (Livia op. cit., p.145) and how or indeed if they push the margins of expected norms will be interesting to discover and this study will analyse the most documented traits of politeness providing indicative results regarding the use of politeness structures.

6

Anglia Ruskin University MA Applied Linguistics with TESOL

1.4

THE APPROACH TAKEN TO THIS STUDY

This approach is largely quantitative. Given the quantity of the data comparatively little of it can be analysed at an utterance level. Linguists (Tannen, 1990, Holmes, 1995, Coates, 2004 amongst others) have demonstrated how men and women use language differently although quantifying this data remains an immense task given the syntactic variations any polite request can take and the influence the pragmatic meaning can have on an utterance.

An utterance is defined here as being synonymous with what many (Schegloff, 1968, Schegloff & Sacks, 1973, Crystal, 2009) have defined a turn. A turn it is essentially a statements by an actor. Using the Friends data, utterances are always scripted and the ambiguity of what constitutes a turn is largely a mute point as each utterance is clearly marked in the transcript. In this study the term utterance will be used instead of turn.

Traditionally, sociolinguists have approached the issue of linguistic and prosodic differences from a mostly non-computational perspective using relatively small and very focused data sets. Separately, where corpora have been used it has typically focused on a lexical view of language, identifying chunks or the most frequent vocabulary (e.g. Cheng, 2004 or MacFadden et. al., 2006). This study is a hybrid of both approaches but with a heavy bias towards the latter. While frameworks exist for annotating stretches of speech the overhead is huge and out of the bounds of this study. Therefore many of the results will be based on a very topdown methodology looking at vocabulary and phrases in isolation largely independent of the surrounding vocabulary or context.

This study is relevant in so far as it can show to what extent the writers’ preconceptions about gender influence their writing for each character. In television, I hypothesise that a stark ‘binary difference’ still exists where taboo topics are avoided by virtue of advertisers evidently (Fisher et. al., 2007) calling the shots.

7

Anglia Ruskin University MA Applied Linguistics with TESOL

2. BACKGROUND INVESTIGATIONS RELEVANT TO THIS STUDY The literature review will be divided into three distinct sections each dealing with a discrete item of the project. 1. The nature of discourse including issues surrounding the transcription of spoken data. 2. The differences in politeness between the genders. 3. Corpus design

2.1

DISCOURSE-RELATED LITERATURE

Detailed discourse analysis cannot be satisfactorily carried out without some knowledge of the situation or context in which the discourse has taken place. Views differ on which contextual features are important. Hymes (1974) for instance identifies almost a dozen variables as being necessary to adequately describe a situation. Hymes’ communicative competence is a key term which refers to the knowledge of social appropriacy regarding language use. Halliday (1985b) claims that our ability to put together well-formed spontaneous utterances effortless in a casual context is a trait of every native speaker, but this view is at odds with that of Chomsky (1965) who argues that all language is “degenerate” from the basic rules of grammar.

Discourse markers and their use are at the heart of this project so it is only apt to define what actually constitutes one. Discourse markers are “a class of expressions, each of which signals how the speaker intends the basic message that follows to relate to the prior discourse” (Fraser, 1990, p.387). Fraser’s more recent work (1999) builds upon the function of discourse markers, such that discourse markers provide commentary on a relationship between two utterances, a specification which is not conceptual, but procedural providing information on the interpretation of messages. Halliday & Hasan (1976) echo this approach forking slightly to suggest that their use helps to create a cohesion between utterances and ideas. Typically, discourse markers add little meaning and identifying them can be done by removing them from the utterance and validating that the meaning remains the same.

A handful of models exist which provide a framework for the deconstruction of dialogue, be they TV sitcoms or spontaneous conversation. These models are important to this study because they serve to underline the importance of treating a conversation as a series of 8

Anglia Ruskin University MA Applied Linguistics with TESOL interlocked turns rather than just an utterance in isolation.

One of these models the

Birmingham model.

Birmingham Model Sinclair and Coulthard’s (1975) approach to classroom discourse is just one of many which allow for the application of a framework to spoken discourse, and being a popular one it has given us a structure which has inspired others. They characterised an exchange as a series of moves from: Initiation  Response  Follow up. For example:

Initiation:

Speaker A:

What time is it?

Response:

Speaker B:

6:30

Follow Up:

Speaker A:

Thanks

They note how such exchanges are not fixed and culturally the sequence may deviate as illustrated here: I

A: Oiga, pot favor, ~ q uhio ra es?

(What time is it?)

R

B: Las cinco y media.

(5:30)

F

A: Gracias.

(Thanks)

?

B: De nada.

(Not at all)

With a little work the formula was modified to accommodate this redefining the structure as: I (R/I) R (Fn)

(Coulthard and Montgomery, 1981, p.112)

Within each move an act can be defined detailing the function or the purpose of the move, and Francis & Hunston (2002) detail 32 acts which can be applied to any one move. This is important because while a move may be a ‘follow up’ the meaning can vary depending on the context: A: What time is it? B: 6:30 A: Thanks. (stranger to stranger), or A: Good! Clever girl! (parent/teacher to child), or A: No it isn't, and you know it isn't; you're late again! (boss to late subordinate)

9

Anglia Ruskin University MA Applied Linguistics with TESOL While the I/R/F model remains intact this illustrates that we can only be “absolutely sure of the function of the initiating utterance when it is contextualised with the response it gets" (McCarthy, 2000, p.120). Such a model is convenient in this rigid two party setting or similar environments such as patient  doctor but it fails to stand up against casual conversation and all of the nuances it brings with it.

Similar models have been developed to try to

accommodate this.

Systemic Functional Linguistics Halliday’s Systemic Functional Linguistics (SFL) is another approach to the analysis of language. His view is that discourse analysis which is not based on grammar is nothing more than a “running commentary on the text” (Halliday, 1985: xvi), and the aim of SFL is simply to be able to make it possible to say sensible and useful things about texts. SFL has four main theoretical claims about language: 1. That language use is functional. 2. That its function is to make meanings. 3. That these meanings are influenced by the social and cultural context in which they are exchanged. 4. That the process of using language is a semiotic process, a process of making meanings by choosing. The functional approach SFL takes to language is two-fold. Firstly, it allows us to see how language is organised and secondly it allows us to make meanings and study how language is used. Halliday proposes that language be inspected from three different points of view 

Ideational meaning (field): i.e. The real world meaning



Interpersonal meaning (tenor): i.e. The relationship between speaker and hearer (or writer and reader)



Textual meaning (mode): i.e. The text is organized as a piece of writing or speech.

These three variables are important claims Halliday because they are the three kinds of meanings language is structured to make. Practically, the use is methodical to apply and has been used primarily in the examination of short texts, typically advertisements (both spoken and written, for example, see Simpson (2001)) the pitfall of it comes in its scalability, and much like Sinclair and Coulthard’s (1975) model the overhead needed to adequately analyse text becomes unreasonable. Hence although frameworks exist their application in corpora is restricted.

10

Anglia Ruskin University MA Applied Linguistics with TESOL

2.1.1

SPEECH TRANSCRIPTION

Labov (2006) believed that the more attention a speaker is forced to attend to their speech the more the style and register of their speech would be affected. With regard to this study, given the environment (TV studio) upon which the recording is made one must assume a high regard to the management of one’s speech and an accuracy to the script. The focus therefore moves to that of transcription. The transcription of the spoken word is key for a number of reasons: i.

The data derived from the transcriptions will be the basis for the results.

ii.

Transcription is a selective and subjective process.

iii.

The level of detail in a transcript will determine, to a certain extent, the acuteness of the results and the ability of those results to be reproduced by othe rs.

“A transcript should not have too much information” Ochs remarks in her seminal paper on transcribing the spoken form (1979, p.44) one which does is cumbersome, too difficult to reliably assess and can leave the researcher with the task of “sorting out the forest from the trees” (op. cit. p.56). Cumbersome notations such as ‘she leaves the room in disgust’ are subjective and imply the observer’s biased (Bucholtz, 2000, DuBois, 1991) or political (Green, et. al., 1997, p.172) opinion of the actor/actress. Such “cooking” (Trappes-Lomax, 2004, p.141) of the data is undesirable when the objective is to be neutral and analyse the data dispassionately without bringing any preconception to the task nor mentally filling in the blanks or jumping to conclusions. Illegibility can also be a problem; the non-verbal context changes an almost finite number of times in a thirty minute broadcast and Ochs (op. cit.) documented a paragraph of tugs, grimaces, frowns, pushes, turns from a three year old upon the utterance of ‘yeah, I’m gonna make a car’. This is also illustrated here: 1 Caller 2: 2 3 4 CPO: 5 6. 7 8 9 10 Caller 2: 11 12 13 14 Caller 1: 15 CPO: 16 Caller 1:

Um:: (0.6) Yeah I’m ok(h)ay. Y’okay ab’t- al:right then. (0.1) HHH so::=um:: (0.1) okay.=so Kathryn was just sayin abou::t (0.2) [ye know th-] [ AHH HH ]Hk iuHHhh uhh (.) I ca(h)n t ta(hh)lk. (1.2) Hello:? Hello::? I’m sorry she’s just li:ke >broke out in 11

Anglia Ruskin University 17 18 CPO:

MA Applied Linguistics with TESOL tears< she can’t spea:k. Ri:::ght.

Figure 1: Transcribing a fit of crying (Hepburn, 2004, p.252)

Standard orthography also forces a literal interpretation of utterances that otherwise may be simple objects of “phonological manipulations” (Ochs, op. cit. p.45). This morphological approach assumes that language is being used for information and not for sound play, and it may be that the researcher is intent on examining the language used, however the language choice may be influences by social factors. Such is the genre of the corpus in this study that assimilations of speech and alliterations are functions used to exhibit humour; a corpus void of such phonological transcriptions makes such analysis [of this area of research] virtually impossible, a point which extends to any corpus void of this annotation. Form and function have to be separated to understand what is happening in discourse. There is no 1-1 relationship between what is being said and the communicative function We are required to look at not just the grammatical form but also the context in order to understand the function of the language. Also a return to the floor by the speaker does not necessarily denote a response and that response may or may not be verbal. An exchange has to be initiated and a response to a previous utterance is indicated by a great number of non-verbal clues - eye contact, gestures, etc.. Without these non-verbal clues there exists an ambiguity as to whether the latest utterance is a semantically related adjacent utterance to the one it follows or if it is ‘off-topic’, an issue which is exemplified with more than one speaker.

While we (generally) express our intentions verbally extended annotation would be suited to dialogue where the communicative load of the spoken word is low relative to the context of the situation. Several agree (Green et. al., 1997) that there are two strands of transcription: 1)

interpretive and 2)representational. Although the distinction between the two becomes

difficult as it is applied. Roberts (1997) alludes that, transcription is still a subjective process and one where transcribers “call up the social roles and relations constituted in language and rely on their own social evaluations of speech in deciding how to write it…. transcribers bring their own language ideology to the task” (Roberts, 1997, pp.167-168). Labov and Fanshel (1977) add that currently it is impossible to “provide a context-free set of interpretations of prosodic clues” (p.46), and many would agree that this is still a significant issue thirty years after they made this statement.

12

Anglia Ruskin University MA Applied Linguistics with TESOL There are three questions (Crowdy, 1995) which need to be asked when designing the criteria for any transcription scheme: 1. Who is the transcription for? 2. How will it be used? 3. What are the important features? Much has been written about the importance of intonation (Coultard, 1995), and loosely defined, intonation, stress and rhythm are the key elements which compose prosody. Itself being the ability to express oneself using linguistic elements aside of grammar or vocabulary. Effects such as sarcasm and irony are readily expressed prosaically and their use in humour has a long history. The intonation and stress pattern of an utterance heavily influence the reply which is expected and recognizing intonational features and stress is a relatively straightforward process; however the issue of transcribing such elements is far from trivial. Jefferson whose transcription system has been widely adopted allows for ‘continuing intonation’, a ‘stopping fall’ and three types of stress. Coultard (op. cit.) laments that a recording of the utterances accompanying the transcription is mandatory as (again) the application of these features is subjective and that there is little consensus on when and when not to apply certain markers. While this is a valid point it must be noted that transcription is done for the purpose of analysis and if it is being analysed for some prosodic purpose then any human-derived level of transcription is beneficial, while features are subjective presumably it is being analysed subjectively too. I would also concur with Coulthard (op. cit.) that an accompanying recording would always be always helpful.

Some estimates put word-level transcription at ~10 minutes for every 1 minute of audio (Crowdy, 1995) while to produce a detailed transcript, complete with overlaps, stress markers and changes in tempo/pitch/speed this increases to ~20:1 (Ervin-Tripp, 2000).

13

Anglia Ruskin University MA Applied Linguistics with TESOL

2.2

SEX AND GENDER

While sex is defined as a biological characteristic, gender is somewhat more subjective. Gender has been defined as “a collection of traits or characteristics which a culture associates with a either men or women” (Kiesling, 2004, p.229, emphasis added) hence de Beauvoir’s adage ‘one is not born a woman, but becomes one’, everyone having been conditioned into one of the gender roles by cultural conventions. Western culture is geared towards the polarization of sex to the extent that biological sexual variance can be surgically corrected to ‘fit’ the binary categories; breast augmentations are on the rise and men strive to keep hold of however much hair they have left both sexes keen to conform to as close to the ideal as possible.. While Cameron’s views (2005) on the linguistic capabilities of the genders becoming closer aligned, holds some water societal factors still heavily underpin and to a certain extent accentuate the typical gender stereotypes. Importantly it is gender and not sex which is under review in this study.

The term women's language is problematic because much like ‘child’s talk’ it draws a comparison again a norm - in the case of gender this is an androcentric norm. Kaye in her two articles (1989a, 1989b) highlights just some of the linguistic biases of the English language, and the verb ‘gossip’ and the heavy female connotation it carries reinforces societies perceptions that women’s talk is ‘idle’ in contrast to men’s talk which is real and important.

1a.

S_r:

A respectful term for addressing males.

1b.

M___m:

A woman in charge of a brothel.

(Kaye, 1989a, p.185) The linguistic patterns women use has a long history, and women have been found to use a higher proportion of prestige forms than their male counterparts; where prestige is defined as “linguistic forms normally used by the social group with the highest social status” (Coates, 2004, p.47). Trudgill’s (1972) Norwich survey validated this hypothesis and women were found to use the /ng/ variant more than the stigmatised /n/ alternative; true against all social classes. Trudgill found that in formal contexts where the possibility for hypercorrection was greatest, women were more sensitive to their speech than men while interestingly the lowermiddle-class women were found to upgrade their pronunciation to that of their peers one class above. Macaulay’s (1978) study into /i/ revealed similar, interestingly, he also found that lower-middle class males identified more with the working class than with the upper-middle class. Overwhelmingly women’s pronunciation was found to be closer to the prestige model 14

Anglia Ruskin University MA Applied Linguistics with TESOL than men’s and where women’s speech didn’t match that of the prestige variant women were prone (Trudgill op. cit.) to report that they did while under-reporting for men was the norm.

In Labov’s 1966 (2006) study in New York he found that Macy’s e mployees (gender not given) stratified themselves with that of the higher-class Saks’s employees; the sales people borrowing the prestige from their customers. There was a distinction that middle-class employees tried to associate with the class above while in this study those at the bottom rung of the ladder (working for Klein’s) seemed resigned to keeping their working-class vernacular possibly at an attempt to identify with their customer base. Much like Trudgill’s study (op. cit) it was the lower middle class who were prone to hypercorrect, possibly more conscious of the prestige forms and the emulation thereof. The hypothesis that women themselves believe that they are perceived by how they appear and say as opposed to what they do ensures they try to elevate their social status by monitoring the use of their language. Their need to maintain this prestige form may influence the number of risks women take with the language, producing fewer truly unique forms and pushing their linguistic boundaries less than their male counterparts, a test which can be validated in this study. As a rule the belief is that it is women who keep within the bounds of their linguistic parameters, but recent research has shown that when it comes to language change it is the women, and not the men who lead the way. This is an important area of research and the linguistic behaviour of the genders is a key component in this study. One area which has been extensively studied is the variation in the non-standard forms of ‘was’ and ‘were’.

Figure 2: Diachronic distribution of non-standard was among men and women in York (from Tagliamonte, p.182) X axis represents generation.

Tagliamonte (1998) investigated. ‘was/were’ variation in the city of York. She found a plethora of grammatical violations (see Figure 2) specifically in relation to existential constructions and

(2)

(1)

non-standard was in

non-standard were in negative tags leading to such

malformations as “They said there were nothing wrong” and “He was shorter and stockier, 15

Anglia Ruskin University MA Applied Linguistics with TESOL weren't he?” (respectively) among others (p.155). Rather worryingly she warns “…if the males adopt the change, the morphological form was in existential constructions may become the norm, at least in the spoken language” (p.184). Interesting to note is how the males are assumed to be the tipping point.

Eisikovits’ (1991) Inner Sydney study also revealed some interesting results with the use of non-standard past tenses but they appear to contradict Tagliamonte’s study. While Eisikovits found that girls did use more non-standard past tense forms than boys, she also found that this was only true of young girls (aged 13 as an average) whereas by adole scence (average age 16) the linguistic patterns of these girls had changed and aligned closely with the boys’ usage which had been relatively stable throughout.

Eckert’s (1989) Detroit study was not related to was/were and did not find a polarisation between the genders per se. In fact Eckert was very specific in the need to account for societal factors as underpinning her results. Her ‘jocks’ and ‘burnouts’ study has received a great deal of attention and highlights Cameron’s (2005) need to move away from a view of gender as we view sex. Her study did not highlight any significant breakaway phonetic mutations but she did find that girls were more experimental using “more variation than the boys” (p.263). Eckert showed that the most extreme users of phonological variables were by those who have to do the greatest amount of work to affirm their membership in groups or communities. Aligning themselves with their peers is seen as integral and this linguistic ‘Community of Practice’ is one way to achieve that.

In brief these studies show that women use more prestige forms, adolescents experiment more – with girls leading the way; Tagliamonte’s diachronic anomalies (see Figure 2) were explained away with generational shifts. Ultimately males were found to be linguistically content although even Trudgill’s study (op. cit.) found that it while it was true that women used more prestige forms it was men who led the way in the use of non-standard forms apparently unconcerned with their low-prestige forms. An issue must be taken with any study which bases its results around adolescent teens – an issue which will come up again shortly. This identify forming period raises its own questions although the finding that gi rls experiment more linguistically remains valid.

16

Anglia Ruskin University MA Applied Linguistics with TESOL

2.2.1

POLITENESS

Figure 3: Wolfson's bulge model (1998)

Wolfson’s bulge model proposes that the greatest level of politeness is given to those whom we know where the relationship is dynamic and open to negotiation (Wolfson, 1998), she hypothesised that when there is a constant need to reaffirm one’s relationship more polite language is used. The contradiction which immediately comes to mind is that of groups of men engaged in team sports. Here the language is frequently coarse and standoffish; insults fly and slang is de rigor. The function of language here, however is one of aligning oneself with a group, they are, by all accounts, linguistically inclusive devices used to express inclusion and camaraderie with a group (a view also echoed by Pilkington, 1992). Here one would expect to find very low frequency counts of any politeness device however the language is still be used to express positive politeness or friendliness. Moreover the sitcom ‘Friends’ is based around 6 relatively well-to-do twenty-somethings (later becoming thirty-somethings) a brother and sister pair and four of the six are referenced as having schooled together – essentially friends who are intimate with one another whom altogether have a low social distance from one another. The show frequently centres around two apartments in which 4 of the stars cohabit and given Wolfson’s model one would come to expect an exceedingly low level of politeness regardless of gender within conversations confined to 1)the 6 characters 2)the four who live in the two apartments and 3)those who actively live together. Interruptions are a device which are often analysed with the view to making hard-and-fast conclusions. Studies such as Zimmerman & West’s (1975) suggested that they are another way in which men maintain their ‘conversational dominance’.

Sacks et al. (1974) model of

conversation defines simultaneous speech as violation of conversational politeness ; their model documents conversations as orderly and predictable events where the participants quick-step in time to one another harmoniously transferring a metaphorical baton between themselves at each legitimate completion point. Tannen (1994) had narrowed the definition slightly to “when a second speaker usurps another speaker’s right to continue speaking by taking the conversational floor in the absence of any evidence that the other speaker intended to relinquish the turn.” (p.33). Essentially interruptions are a device used to impose ones power on the conversation, demeaning the speakers right to talk. 17

Anglia Ruskin University MA Applied Linguistics with TESOL Traditionally men have been seen as the power brokers of society and hence the inference has been that men are more prone to interrupt. To an extent this hypothesis has been validated; Zimmerman and West’s (op. cit.) study bore this out with a resounding 96%/4% ratio and Drass (1986) observed that men were more likely to interrupt when their listening time far exceeded their talk time. Drass graded 91 undergraduates according to their ‘maleness’ (see Drass, op. cit. and/or Burke & Tully, 1977 for the details), and in a nutshell concluded that sex per se was irrelevant but that the gender variable proved influential: “...the more maleness *the person+… the greater the hazard that the person will initiate an overlap or interruption during a conversation” (p.299). This could be because men are simply conditioned to be the ones dominating the conversation, Drass concluded or as Tannen declares the parameters are just different. Tannen (1990) documents how in an hour-long radio phone-in about abortion – a key women’s topic - “all the callers except two were men” (p.88), she goes on to draw attention to how females loathe “‘catapulting’ themselves onto centre stage” (p. 88) or being the centre of attention – should women choose to interrupt in an public environment this is exactly what they would be doing.

Most of these studies have looked at the polarization of gender and assumed that each participant is on a level playing field to start with an equal knowledge of the conversation. However this is rarely the case.

Televised political interviews are a rich source of

conversational analysis and in 1982 Beattie analysed two interviews, one with a male interviewer and the leader of the opposition - Jim Callaghan and a second with a different male interviewer and the then Prime Minister - Margret Thatcher. Despite her powerful position, Margret Thatcher was interrupted significantly more frequently than Jim Callaghan. While Thatcher herself interrupted less than 40% of the time compared to Callaghan’s 55%. The conclusion here was not necessarily one of either power or gender but intelligence. Rim (1977 in Beattie, 1982) states it is the less intelligent subjects who interrupt more frequently than the more intelligent subjects. Broad sweeping statements are foolish, however the evidence suggests that there is a complex relationship between interruptions and dominance and that they are affected by many variables.

English is a flexible language where the lines between polite and impolite are blurred and subjective. North-East Asian languages with their honorific forms are very different and Japanese Women have long been the subjects of research (Smith, 1992) as to how women in power address their [male] subordinates. Politeness here is not marked by the use of inference or intonation but instead by the unidirectional use of ‘keigo’ a ‘honorific langu age’ 18

Anglia Ruskin University MA Applied Linguistics with TESOL used by a wife to her husband, from youngsters to their elders and so on. Such forms uniquely sum up the powerless status females in this society have.

a) “Yobu.”

[Translation: Call (imperative).]

b) “Yobimasu.”

[Translation: Call – polite variant. e.g. 'I will call him'.]

(from Smith, 1992, p.60)

Intriguingly, Smith studied the “linguistic conflict” (p.79) women in authority find themselves in when being forced to chose between a traditional powerless-heavy form or a more appropriate level of speech. A solution she discovered was that far from mimicking the speech of males, the empowered women got creative; they employed linguistic forms mothers use when talking to their children and created “new and powerful strategies” (p.79) where they acted passively but confidently omitted honorific verbs completely or substituted them. At no point did they ‘tread on the toes’ of their male counterparts and use the same language they would use.

On the back of work done by Lakoff (1973), O’Barr & Atkins (1997) embarked on a project to define women’s language, their highly popular paper concluded that it is not so much women’s language as a powerless language and that anyone is a powerless position will use more of these devices. “’Woman’s language’ is not fundamentally about gender but more basically about the lack of power” states Bucholtz (2004, p.6). O’Barr & Atkins’ courtroom study found that similar linguistic patterns were observed between a testifying housewife and a testifying male ambulance driver (both relatively powerless) while well educated professionals exhibited inverse features regardless of gender. Deuchar (1988) also argued that it is the powerless who are more likely to be linguistically articulate than those with power.

O’ Barr & Atkins

determined that power “derive*d+ from either social standing in larger society” (p. 384) is the key determinate and not gender. However they still upheld the assertion that “women tend to be high in women’s language features while more men tend to be low in these same features” (p.385)

Tannen asserts that women use language to achieve create and maintain relationships, defining ’rapport talk’. For women, "talk is the glue that holds relationships together," (1990, p.85) and so conversations are "negotiations for closeness in which people try to seek and give confirmation and support, and to reach consensus." (p.25). Men, on the other hand, use language to convey information, resulting in what Tannen calls ‘report talk’. Because men maintain relationships through other activities, conversation for them becomes a negotiation 19

Anglia Ruskin University MA Applied Linguistics with TESOL for status in which each participant attempts to establish or improve his place in a hierarchical social order. On top of this, the genders also view politeness differently where “what is perceived as rude, impolite and disruptive by women may be acceptable and normal in male interaction” (Holmes, 1995, p.53).

Cameron’s views (2005) on a spectrum of genders aside momentarily; there is no doubt that men and women are fundamentally different. A toy company, tested a playhouse which was being considered as a gender-neutral toy. After tests with children they found that girls and boys did not interact with the structure in the same way; amusingly “…the girls dressed the dolls, kissed them, and played house …the boys catapulted the toy baby carriage from the roof” (Sommers, 2001, p.27). Boys and girls are different concluded the company. Not only are our use of toys different but also our use of language and such distinctions are visible from an early age. Many (Sauntson, 2007 in England & Farris, 2000 in Taiwan) have shown that school boys and girls use language differently; boys are more direct and insistent while girls combine direct and indirect strategies softening their language. Farris concluded that girls were imitating their mothers while Sauntson concluded that girls “tend to contribute towards the accomplishment of collaboration and consensus within the group” ( 2007, p.322) opposed to boys who use language for the “negotiation of status and hierarchy within the group” (2007, p.322)

Therefore, if politeness is the domain of the women is the conclusion that men are impolite, rude and abrasive? Is impoliteness a gendered monopoly, the domain of only one gender? No-one would reasonable suggest it is but does one gender ‘do’ impoliteness better than the other? Consider the following: The BBC2 TV show “The Weakest Link” features a sharptongued host - Anne Robinson dubbed the “Queen of Mean” (The Sun, 2011). The show originated here in the UK however since then it has been exported to a further 42 countries (Wikipedia, 2011). Out of the 43 countries producing the show the production agency found it appropriate to install a female presenter as the host for all but 10 countries – giving women a near 80% stronghold on the role of playing “a cross between Cruella de Vil and Hitler's mother” or any one of a handful of playful comparisons (BBC, 2001) This raises a very intriguing question - Why did production companies the world over pick a woman for this role when they could have as equally picked a male counterpart? I propose no answer to such a question however the facts are interesting and have broad social implications.

20

Anglia Ruskin University MA Applied Linguistics with TESOL Definitions of what constrains a ‘taboo word’ vary and research into this area has faulted at this very hurdle; however the general negotiated consensus is that they are governed by social and intrapersonal parameters; heavily dependent on contextual variables - predominantly ones company. Taboo language has long been associated with males despite their having been little in the way of hard evidence to support such claims (Coates, 1986). Men are perceived as using more swear words than their female counterparts and they are perceived to use them more often. However recent work has cast doubt on this. Gao (2008) has shown that women have the more active vocabulary when it comes to swear words (47 vs. 31). His study of taboo language in another popular prime-time TV series ‘Sex and the City’ is important although it is worth noting that the professions the female actors are engaged in range from an independent businesswoman and a career-minded lawyer to a newspaper columnist and straight A finance student - the epitome of empowered positions. This is in line with Trudgill’s (1983, p.30) assertion that swearwords "occur very frequently in the speech of some sections of the community. This is largely because taboo-words are frequently used as swear-words, which is in turn because they are powerful." Other studies have had interesting results, for example Hughes (1992), found that both men and women swore more in the company of their own sex and that males’ use of swear words dropped significantly in mixed-sex conversations.

If power is the key to the use of swear words then one would expect a correlation where the social status of peers is the same – for example pupils at school. De Klerk (1992) examined exactly this using different groups from different schools (consistently year 6 and year 9 pupils) she found that the hypothesis that females did use taboo words was supported by her data, lamenting “female[s] were never at a loss for derogatory words” (p.10), age correlated more than gender she found. While a study during the identity-forming adolescent period raises its own questions the conclusion that girls were not striving for prestige speech but instead that which their peers use muddies the water somewhat.

21

Anglia Ruskin University MA Applied Linguistics with TESOL

2.3

CORPUS RELATED LITERATURE

The growth of corpus linguistics has bridged the gap between Chomsky’s view (1965) that only ‘competence’, and not ‘performance’, was worthy of study. His view that corpus linguistics “is almost useless… for linguistic analysis of any but the most superficial kind” ( 1961, p.131) has been well and truly extinguished; as Baker et. al. (2000, p.421) proclaim, “corpora have become the touchstone of credibility”.

Building a corpus is a complicated affair but they are needed because they show us how language is used - descriptive accounts rather than prescriptive. Owen (1993) and Francis & Sinclair (1994) engaged in a tête à tête on the merits and wherewithals on corpus linguis tics each vehemently arguing their corner. New insights into language are afforded by large corpora insist Francis & Sinclair while “grammars are founded on their authors' linguistic beliefs” and the “availability of information does not guarantee *an+ accurate description” insists Owen. Chomsky (1962) has argued that, whatever is found in a corpus is restricted to what is in the corpus and is not necessarily representative of the entire potential of a given language; his generative grammar theory has long been fraught with issues and copious amounts of real-world data only serves to reinforce such scepticism. In corpus analysis we are interested in language use as it is used (frequent collocates, idioms etc.), and while the Cartesian Product of syntactically valid sentences given only a handful of nouns and verbs is vast, we fail to exercise this creative extent opting instead for a tiny subset of what is grammatically correct (Pawley & Syder, 1983).

Representiveness is a key term which repeatedly crops up in corpus design. When it comes to corpora it is the holy grail (Biber, 1993). It entails having confidence that the sample of data selected adequately represents the genre of data it is selected to represent, achieving it would somewhat allay Chomsky’s (op. cit.) fears. With reference to this study representiveness is not a large issue since the corpus includes all available data providing a “thorough definition of the target population” (Biber, 1993, p.245) although a discussion is still merited. Having too few texts in a corpus can of course lead to skewed results and any generalisations can be misleading. Biber et. al. (2006) cite how 'burrow’ is the third most common collocate with the adjective ‘great’ (behind ‘deal’ and ‘man’) according to the Longman-Lancaster Corpus. Their explanation for this anomaly was the result of the inclusion of the text from ‘Watership Down’ where ‘the great burrow’ was the central meeting place for the rabbits. 22

Anglia Ruskin University MA Applied Linguistics with TESOL Also of interest is how different genres of text can influence results – words are used differently in different contexts. ‘See’ collocates highly with ‘paragraph’ and ‘chapter’ in the 2.2 million word British National Corpus (BNC) Law subset but with ‘patients’ and ‘table’ (as in on a page) in the 1.4 million word strong BNC Medical subset.

The BNC in their strive for representiveness somewhat arbitrarily divided the UK up into 12 zones in three supra-regions: The North, The Midlands and The South (see Figure 4 below). Most strikingly is the size of each of these areas - the heterogeneously rich West and the East Midlands represented by only two distinct areas. The linguists then set about collecting data from each of these regions; interestingly, within each region demographic sampling was used using “random location sampling procedures” (Crowdy, 1995, p.225). This is not a unique phenomena and subsequent approaches (Baker et. al., 2000) have also had to decide just how many nominal forms their corpus represents.

Figure 4: The BNC Spoken Corpus regions (Crowdy, 1993, p.260)

With written data, Atkins et. al. (1992) cited 25 different variables which must be considered in the design of a corpus; they range from cataloguing the age of the author to categorising the 23

Anglia Ruskin University MA Applied Linguistics with TESOL text status as either original or an adapted reprint. For spoken data the BNC devised a conversation log (see Figure 5 below). The audio was recorded on cumbersome cassette players, hidden from view essentially carried around in people’s pockets (Crowdy, 1993) . Such issues have long gone however as Leech et. al. mention, a new set of problems are now being dealt with.

Figure 5: BNC Spoken Corpus ‘Conversation Log’ (Crowdy, 1993, p.260)

“The use of the computer, in spite of the many advantages and new possibilities which it opens up, does not resolve the problems of the relationship between the original speech event and the transcription, nor does it obviate the problem of representing spoken language in written form.” (Leech et. al., 1995, p15)

At least one subsequent attempt (Baker et. al., 2004) to the BNC tried to replicate this method of concealed recordings only for it to end in abject failure. The architects of the project ended up using recordings from BBC broadcasts after their attempt to get spontaneous spoken data from fourteen South Asian languages all non-indigenous UK minority languages proved problematic. Issues regarding an active resistance to being recorded led to tiny amount of spoken data being recorded and the linguistics ended up harvesting the spoken data from the 2-hour nightly broadcasts from the BBC’s Asian Network – language which not as stifled and 24

Anglia Ruskin University MA Applied Linguistics with TESOL scripted as my Friends corpus still poses significant questions about its reflection of true spontaneous speech an observation also concluded by Baker et. al. (2000). Serendipitously they cited both clearer sound quality and a more distinct sentence structure to th eir advantage.

Resorting to BBC broadcasts did not alleviate their problems. Employed transcribers actively refused to transcribe some dialects declaring that linguists should study “classical… texts and not the bastardized slang” (p.516) the immigrants used. There were also issues over code switching and objections by the transcribers over the non-standard and non-prestige forms being studied. The result of these tribulations was that the 93 million word total was made up of a little more than 2 ½ million spoken words with the rest being harvested straight off the internet. Most of their text from daily news websites, exploiting their archive s of material. Far from just logistical reasons – although this was the overriding force for their selection – the web sites also provided a range of genres from sports to politics with a range of authors another fortuitous side-effect.

XML (eXtensible Markup Language), the widely-used successor to SGML (Standard General Markup Language) are both used for the storage and processing of texts in electronic form (although the latter could be considered obsolete). The big advantage they both hold is that they are interoperable between devices and between operating systems unlike proprietary systems such as an Oracle database or Microsoft’s .doc format. This is important given the variable nature the data will be used for. Essentially the structure of the data is a hierarchical tree (see Figure 6). The BNC used to index their data using a standard mark-up language (SGML) although this has since changed to the more interoperable XML. The reason for the change was simply a decrease in the monetary cost of storage, a key issue in the late 20th century.

XML, despite its name, is not a mark-up language in that it doesn’t mandate any tags (in contrast HTML mandates codes such as , , , <H1> etc.). Instead XML is a meta-language which allows the end-user to define their own mark-up. XML provides the structure and the mark-up is flexible. Herein lies the problem: A non-prescriptive approach is no better than no approach at all, a number of texts randomly thrown into XML documents holds few if any advantages over their native-format equivalent. It was for this reason that the Text Encoding Initiative (TEI) was embarked on. 25 Anglia Ruskin University MA Applied Linguistics with TESOL In the early 90s work started on specifying a mark-up language for corpora. Problems with mark-up and annotation had arisen from an incompatibility across corpora and a standardized format has long been desired. The TEI1 aims to issue a standard regarding corpus design, a checklist of what to and what not to include in corpus annotation The body publishes documents in the format of document type declarations2 (or DTDs). Currently in its fifth incarnation the documents specify how mark-up should be applied. Doing this allowed paralinguistic features such as <shouting>, <singing>, <whispering> etc. to be defined, all of which all present in the BNC. and <pause dur="200"/>uhhmm <pause dur="75"/>then she said <vocal desc="sniff"/> <pause dur="50"/> with <pause dur="20"/> a glint in her eye <shift feature="normal" new="loud"/> that‘s why penguins can’t fly <pause dur="1000"/> do you get it? Figure 6: The TEI mar kup of a punchine (own material) As a corpus linguists the level of mark-up one needs depends on the use and the purpose of the corpus; one can opt for a simple cheese and tomato pizza (TEI Lite DTD – a minimal set) or a meat lovers supreme (TEI full). The specification is currently running into problems as the amount of mark-up present risks obscuring the ‘woods from the trees’ (echoed by Ochs, 1979) hence a layering approach is has been proposed (Witt et. al., 2009) which aims to compartmentalise multiple annotated XML files where each XML file is annotated around a theme (e.g. non-verbal actions, clauses, phonetics etc) and users can select or deselect based on their requirements. Cook (1990, 1995), has criticized this approach, viewing annotation not as an extra topping, but as a key ingredient, crucial to the interpretation of a speech event. Idealistically he is correct although, pragmatically one is restricted either financially or temporally or both and one must draw a line in terms of how detailed texts should be annotated. 1 2 http://www.tei-c.org/index.xml (Accessed 11th April 2011) http://www.tei-c.org/Guidelines/DTD/ (Accessed 11th April 2011) 26 Anglia Ruskin University MA Applied Linguistics with TESOL ‘Mark-up’ and ‘annotation' are two key terms are frequently used in describing corpus linguistics and a distinction is required. Mark-up consists of three strands and structural markup is usually textual and contextual information such as who the speaker is, their age, gender, the location of the utterance and the utterance’s place in the grander scheme of work. Part of speech mark-up is added by a tagger and categorises words; and grammatical mark-up is the annotation of grammatical structures beyond the level of the word (e.g. phrases, clauses). Mark-up defines static factual data and annotation is an umbrella term for the three different kinds of mark-up. Together all three “add value”(McEnery et. al., 2006, p.4) to a corpus by broadening the scope of what can be analysed. 1. Annotation 1.1. Structural mark-up 1.2. Part of speech (POS) mark-up 1.3. Grammatical mark-up (Meyer, 2004) A corpus is not just a series of texts, instead a corpus attempts to represent some state of a language at some point in time (Biber et. al., 2006). The COBUILD corpus, for example, is a 170 million word bank of English; and is “a sample of contemporary English - no more, no less” (Francis & Sinclair, 1994, p.190). The use of the corpus will shape the design since a large multiple-genre online corpus will have vastly different requirements to that of a single-user genre-specific corpus. This ethos is epitomised by the International Corpus of English (ICE). Over twenty countries are participating in this ongoing scheme which began in the 90s. The corpus as an entity consists of a one million word snapshot of each language and is made up of approximately 60% of orthographically transcribed spoken English of various genres3. The corpus aims to represent the state of each language post-1989. Coordinated by UCL here in England each ‘team’ has a very specific set of guidelines4 which they must follow regarding the transcription of such forms as vocalised pauses, overlapping speech and non-verbal utterances (Nelson, 1995, see also The Ice Project, 2009). Interestingly, while the entire BNC is TEI compliant, the ICE project is not since the tags for this corpus were developed in the late 80s. As of April 2011, only ten language files are available for academic analysis while the latest corpora are currently being tagged with the ‘Constituent Likelihood Automatic Word-tagging 3 http://ice-corpora.net/ice/design.htm (Accessed 11th April 2011) http://ice-corpora.net/ice/written.doc & http://ice-corpora.net/ice/spoken.doc (Accessed 11th April 2011) 4 27 Anglia Ruskin University MA Applied Linguistics with TESOL System version 7’ (CLAWS7 tagset hereafter), a grammatical tagset and the ‘UCREL Semantic Analysis System’ (USAS hereafter) tagset, a semantic tagset (The ICE Project, 2009). As has been discussed in the previous section, spontaneous spoken discourse is notoriously difficult to both transcribe and then analyse given the large amount of false starts, repetition, elided phrases ungrammatical forms, interruptions and the overwhelming reliance on context to discern any meaning. Advances in computer technology have made this process somewhat easier and with every year which passes our competence in the field grows. Technologies such as ASR (automatic speech recognition) are evolving at tremendous rates. Just five years ago this author was part of a software engineering team responsible for a system which facilitated the transcription of hundreds of thousands of medical dictations daily. Doctors phoned one of the DVI’s (digital voice interfaces) which they dictated their patient diagnosis (dictations) to; these could then be converted from speech to text using our dedicated ASR engine within minutes and a 30 minute diction could be ‘turned-around’ and emailed back to the doctor in a text-based report-format within 6 hours. The accuracy of the automated transcription was so high that while the dictation was still presented to a human this process changed from one of blank-slate transcription to correcting the output the engine produced. Unfortunately, the pre-requisites for such a system were great and the costs very real but such an approach was longitudinal and suited our environment where we had a finite number of dictators (although still approximately 100,000). There are obvious benefits of such a technology in speech corpus analysis and just as the cost benefits of storage have come down, one must anticipate a increase in the quality of ASR. Semantic annotation remains a pipe dream computational linguistics although some innovative approaches to the problem are starting to emerge huge problems still remain. On the back of this problem, Essex University have devised a “Phrase Detective”, essentially a game where users have to compete with one-another to identify anaphoric, cataphoric or exophoric references in the presented text (see Figure 7 on the next page). The idea is simply to get others to compete to do the time-consuming and expensive annotation. Sinclair (2002), warns that the introduction of a human element results in a decline in the consistency of annotation, undoubtedly true although this issue has been anticipated in this innovative project and the same phrase is detected by many different competitors, and other checks and balances are in place (see Chamberlain et. al., 2008) to wean out the correct answer as a law of averages. 28 Anglia Ruskin University MA Applied Linguistics with TESOL Figure 7: Phrase Detectives (http://www.phrasedetectives.or g) Figure 8: reCAPTCHA (from Google, 2011) reCAPTCHA5 is another innovative solution in the field of distributed computing. Now owned by Google reCAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart (the ‘re’ is unexplained). Google is running a service to digitise the world’s information and separately, websites need a way to be able to distinguish machines from humans. Therefore since Google needs eyes to verify the text which their OCR (optical character recognition software) has been unable to identify they make these images availabl e, for free, to those who need an authentication system. The system reportedly (Google, 2011) displays over 100 million reCAPTCHAs (e.g. Figure 8) a day, and users around the world help Google fill-in-the-blanks in their digital archive. An innovative solution to a big problem and their paper (von Ahn et. al., 2008) goes into the nuts and bolts of it in much more detail. 5 http://www.google.com/recaptcha (Accessed 24th May 2011) 29 Anglia Ruskin University MA Applied Linguistics with TESOL In both of these projects the train of thought is the same and while humans are still integral to the process the shift from transcribers at a computer to a distributed collaborative approach is an interesting one. Almost twenty years ago Sinclair stated: "As the size of corpora moves into the hundreds of millions, the futility of reliance on human intervention becomes clearer, although it can be temporarily obscured by throwing money at it" (Sinclair, 1992, p.382) No-one would argue the effectiveness of humans and these two projects have begun to show the direction in which humans can be utilised. The problem now becomes one of harnessing such a tool; easy enough for a colossal such as Google but difficult for academia. The grammatical mark-up of text is relatively low-tech and is stable allowing efforts to be focused elsewhere and it must be with semantic annotation that the attention now turns. Speech recognition too continues to evolve and much of the literature on this subject is out of date. One hopes the heavyweights in the field (Microsoft, Phillips et. al.) continue to innovate and this is a technology which looks set to be an integral part of corpus linguistics in the near future. 30 Anglia Ruskin University MA Applied Linguistics with TESOL 3. BUILDING AND USING THE CORPUS The choices which underpin the decisions made in the design of this corpus were made with the answers to two questions: 1)What is available? 2)What does the research question set out to find? “An over-ambitious system could strangle a corpus at birth” warns Atkins et. al. (1992, p.14), and while this project would ideally draw comparisons with language of equal size from a nonAmerican source but all attempts to harvest such data proved fruitless. British TV comedies are routinely too short – typically just 6 episodes per series and none of them draw the fan base necessary to produce ready-to-use transcripts - in hindsight this was a blessing. In this study, the medium, genre and mode of delivery are all static variables, speaker details are available for use and a limited amount of contextual information is available some of which was used. Corpus linguist Leech’s stated that “a great deal of spadework has to be done before the research results can be harvested” (Leech 1998, p.17), and such a succinct phrase encapsulates a process of many man hours and much experimenting. The time and effort required in getting the data into the database in and into a reasonable state was a laborious task. As discussed, the design of a corpus is heavily influenced by its intended use, and given that this corpus will not be used in a high-transactional environment nor will be used by different researchers with very different needs the design was made stable and static. This process will be outlined below. 3.1 CREATING THE CORPUS The transcripts for all 234 episodes were downloaded6 free of charge from the internet. These had been manually transcribed by a group of fans of the show. Steps taken to ensure the integrity of the scripts matched the actual spoken word is subject to scrutiny and will be clarified later. One file represented one episode, although in three cases the scripts for two episodes were contained in the same file as they aired together. The files were in hypertext mark-up language (HTML) format – the standard format for web pages. All HTML mark-up was 6 th http://www.friendstranscripts.tk/ (Accessed 28 April 2011) 31 Anglia Ruskin University MA Applied Linguistics with TESOL automatically removed from these files, and this was done using a freely available application “HtmlAsText.exe”7. HTML file including examples of metadata (underlined): [Scene Central Perk, everyone's there.] Monica: What you guys don't understand is kissing is as important as any part of it. Joey: Yeah, right!.......Y'serious? Phoebe: Oh, yeah! Ross: (trying to ignore her) No. No. Figure 9: HTML file as downloaded. (Excerpt from 0102.html) Text file with the HTML mark-up removed: [Scene Central Perk, everyone's there.] Monica: What you guys don't understand is kissing is as important as any part of it. Joey: Yeah, right!.......Y'serious? Phoebe: Oh, yeah! Ross: (trying to ignore her) No. No. Figure 10: Text file sans HTML mark-up (Excerpt from 0102.txt) Once all files were in this state a further stage was run which was to remove all ‘unnecessary’ information from the files (specifically the notes in brackets). This resulted in a series of utterances which were similar in aesthetically to: Monica: What you guys don't understand is kissing is as important as any part of it. Joey: Yeah, right!.......Y'serious? Phoebe: Oh, yeah! Ross: No. No. Figure11: Utterances before being loaded into the dat abase The reasons for removing this “data about data” (McEnery et. al., 2006, p.22) were that a) Its presence may influence the results of the POS tagger. b) Its presence may influence the results of the regular expressions. c) It is entirely subjective, entered at the whim of the transcriber. d) The non-linguistic actions are not the primary focus of the study. One item of metadata which has been retained in some form is the scene demarcations. For example: “*Scene Central Perk, everyone's there.+” 7 http://www.nirsoft.net/utils/htmlastext.html (Accessed 28th April 2011) 32 Anglia Ruskin University MA Applied Linguistics with TESOL These markers preceded any dialogue to which that scene belonged. These indicators have been removed and do not exist in their original form instead (and please refer to Tables 3 & 4 on pages 44 and 45 respectively) each utterance of that scene has been marked up with a scene reference number. This enabled identification of scenes where only males speak (n=306), only females speak (n=241), scenes with both (n=1764) and scenes which feature an unidentified speaker (n=781). There are 3092 unique scenes throughout the entire 10 seasons. Problems remain, again these scene demarcations were marked sporadically and some episodes (n=approximately 4) contain no scene information at all, therefore these episodes where watched and the scene information was marked manually. Being able to identify if a scene contains only males or only females is vital for comparing both inter-group and intra-group linguistic strategies – statistics at the heart of this project. Initially this scene information was disregarded however after the pilot it became apparent that knowledge of this would be valuable. 3.1.1 THE DATABASE MySQL is a popular, high performance database. It is open source which means it is available for free unrestricted use for all but commercial entities, and it was the de facto choice for a number of reasons: 1. The researcher has personal experience with the product having used it extensively at undergraduate level. 2. It is known to be fast. 3. It allows regular expressions to be used to interrogate the data. Regular expressions are an intensely powerful text manipulation language (discussed in detail in 3.2) In light of these factors no other database system was considered. Loading the data from the file system into the database system was straightforward. The command below takes the text file ‘0102.txt’ (the text from season 1 episode 2) and loads it into the column named ‘line’ in the table named ‘friends’. mysql> load data infile "c:\\friends\\0102.txt" into table friends lines terminated by "\r\n" (line) set filename = "0102.txt"; This command was repeated for each file and the result of this was a 60,849 row table where each row represents an utterance The data as an entity consists of over 4 million characters 33 Anglia Ruskin University MA Applied Linguistics with TESOL and over 2 million words. It is worth pointing out that the each utterance is stored three times. This is best illustrated with an example (see Table 3 on page 44). 1. The ‘original line’ column is identical to that contained in the downloaded transcript. 2. The ‘line’ column represents the ‘tidy’ version of the original line. 3. The ‘metadata’ column shows the line data with POS tags. 3.1.2 ‘TIDYING’ THE DATA Once the data, had been loaded into the database the data could be cleaned further. This essentially involved removing repeated words and truncating superfluous characters and capitalizing the first character of each utterance. While the CLAWS tagger does allow for the presence of repeating words it was considered desirable to remove them. Originally, these steps were done purely for the benefit of analyst however they also had the secondary benefits of (1) marginally speeding up query response time, (2) minimizing the margin for error when using regular expressions. The following are examples of changes which have been made (not a definitive list): a) No no no no no no no no no  No no b) Whaaaaaaaaaaaaaaaaaaatsup!!!!!!!!!!  Whatsup! c) It was- I mean he did it  It was I mean he did it d) I was so so so so so so happy to see him  I was so so happy. e) I was soo happy  I was so happy. Figure 12: Tidy data (not genuine utterances) After an initial pilot in which the first twenty thousand utterances were POS tagged, tagging inconsistencies were noticed. This was to be expected since misspellings and the protruding hyphens in words (e.g. “was-“) caused the word to be tagged as an unknown word. There is no good reason these protruding characters should be present and since they were present in the original HTML files one can only attribute transcriber tardiness. This process of removal could only be done part automatically as the hyphen serves a morphological function in words such as “ex-girlfriend”. Regarding misspellings, the BNC holds a ’control list’ of all permissible non-standard forms such as gonna, y’know, ‘cause and so on (Crowdy, 1994). Such considerations are applicable when dealing with a 100 million multiple-genre corpus however the overhead associated with 34 Anglia Ruskin University MA Applied Linguistics with TESOL creating such a list was considered too great in this instance. Consequently the forms ‘uh’, ‘yep’, ‘yeah’, ‘y’know’ ‘cause’ ‘‘cause’ and ‘gonna’ were all left unaltered. These forms represented no challenge for the POS tagger which correctly tagged these words. A semi-rigorous form of normalisation was taken to correct multiply occurring misspellings. Identifying misspellings was a manual exercise aided greatly by Microsoft Word’s ability to automatically underline incorrect pasted text. Priority was given to those misspellings which were important to this study although very few were actually identified, and American spellings remained in place. While not a misspelling One pertinent modification to the data was made; this is shown below: All right  alright After a pilot run trying to extract tag questions of the form e.g. “You’re OK, right?” a large amount of ‘pollution’ was identified as the result of ‘all right’ being two separate strings therefore this string was normalised to the one word equivalent. Numerous other changes were made, for example, all commas were removed from the metadata column (only), again their presence was sporadic and arguably unnecessary in transcribed speech. All changes that were made were done so with the intention of limiting the pollution of the results. It is important to note that where changes were made the original utterance exists unaltered in a separately stored field in the database (the column: ‘original_line’). This is important for a number of reasons:  Future researchers using this same data can see clearly what has been changed.  Any undesirable change/unexpected side effects can be rolled-back, starting afresh with the original data.  The integrity of the data can be verified. This ‘original’ column contains the utterance almost exactly as it was in the downloaded transcripts. ‘Almost exactly’ because some modifications were made to this data to get the data to load into the database. Problems were encountered with non-ASCII characters. A full discussion of Unicode vs. ASCII is not warranted however briefly Unicode is a character set which allows every character in every language to be represented – the holy grail of internationalisation; while ASCII is a 128 character hangover from short-sighted decisions made decades ago – a subset representing only Western English characters. Transcribers had, in places, used a non-English language and non-standard punctuation outside of the 128 range ASCII provides. There are at least two occasions in the series when a language other than 35 Anglia Ruskin University MA Applied Linguistics with TESOL English is used and the transcribers had transcribed this speech in the character set of that foreign language. Support for Unicode (UTF-8 being the character set) is an integral part of every web browser and hence the document would have rendered properly however trying to load such characters into an ASCII compliant database proved futile and hence all non-ASCII characters were either removed or changed to their closest ASCII equivalent. In this situation the information was lost; however given that this only applied to none English characters it was not perceived as a big problem. Grammatically, the POS tagger was intelligent enough to recognize syntactic differences such as “It’s my apple” vs. “Its my apple”. Despite the inaccuracy the POS tagger applied the correct part of speech. This came as a huge relief as the transcription of possessives and contractions, in places, left a lot to be desired and the only way to have corrected such instances would have been very time consuming indeed. 3.1.3 TAGGING THE DATA Garside et al. (1997) define corpus annotation as “the practice of adding interpretative, linguistic information to an electronic corpus of spoken and/or written data.” (p.2) . It is asking a great deal to expect a “standard notation to provide a usable common framework for the actual categories used in grammatical tagging” (Atkins et. al., 1993, p.24), and consequently, there are a variety of different part-of-speech annotation tagsets:  AMALGAM tagset used in the Brown Corpus and the London-Lund Corpus.  CLAWS1 tagset used in the Lancaster-Oslo/Bergen Corpus and the Spoken English Corpus (with minor changes).  CLAWS5 tagset used in the BNC.  CLAWS7 tagset used in ICE (and part of the BNC).  UPenn tagset used in the University of Pennsylvania Corpus. They are all similar with variations on the finite detail they attempt to extrapolate from the text and different codes for parts of speech. Many (e.g. CLAWS) having been spawned from or supersede a previous version. Once the data was loaded into the database the text was annotated. The automatic tagging of text is a service which is rarely free - either in terms of financial cost or in terms of errors in the annotated text. The criteria for the selection of a tagset was as follows. Access to the tagger must (1) be available in an unhampered form free of charge (2) provide the reasonable lowest (3) error rate have ‘stood the test of time’ academically. Ultimately a decision was made to use the CLAWS7. It provides the largest tagset of all CLAWS version (146 categories – although 36 Anglia Ruskin University MA Applied Linguistics with TESOL CLAWS8 is in the pipeline which is set to surpass it). The rationale behind this is that such a large and rich tagset provides “distinct codings for all classes of words having distinct grammatical behaviour" (Garside et. al., 1987, p.167) and it has an established history. The annotation of data using this tagset has been proven to have an error rate of marginally more than 1% in both the processing of written and spoken data (1.14%, 1.17% respectively; UCREL, 2000). The conundrum posed by Atkins et. al (1992) is one of determining if ‘a male frog’ is a noun phrase made up of an adjective and a noun or a straightforward compound noun. I argue that, as long as the behaviour of this classification is consistent and since comparisons with other corpus will be limited, this is an issue which was given due consideration but ultimately was not a show-stopper. In selecting the CLAWS tagger comparisons with other corpora also marked up with the same tagset would be valid. 3.1.4 DETERMINING AND ASSIGNING GENDER In this corpus, gender was assigned manually. A query determined the most vocal actors and priority was given to accurately assigning these people’s gender as it was imperative to the aims of this study. The gender of other the actors was ultimately guessed but using background knowledge of the series and a broader knowledge of which names are male and which are female. Post assignment, 56% of all actors/roles remained un-gendered (n=479, total=788) however this accounted for less than 3.8% of the utterances. Where characters are void of an assigned gender this is due of a number of reasons:  The character name is ambiguous e.g. “Realtor”.  The character name could refer to one or more roles played by one or more genders e.g. “Nurse”, “Student”.  The gender of the name was ambiguous e.g. “Max”, “Sam”, “Jessie”, “Alex” etc. 37 Anglia Ruskin University MA Applied Linguistics with TESOL 3.1.5 THE INTEGRITY OF THE DATA A number of episodes were viewed and checked for accuracy (n=4), and very few discrepancies were found establishing confidence that the transcribed format accurately matched the actual spoken word. While not 100% accurate the corpus was certainly close enough and any attempts to rectify such inaccuracies would be prohibitively time consuming. As Burnard (1996) notes in relation to the BNC, for transcribing spoken data manually typing in the text “was the only option… and proved to be very expensive and time-consuming, in part because of the very high standards set for data capture”. Viewing these four episodes had the added advantage that one could visually identify the gender of the actors which were programmatically marked as unknown. As an example, in episode 3 in season 2 *after viewing the episode+ the waiter’s gender was changed now correctly being marked as male. “WAITER” U - “WAITER IN 203” M The waiter in this episode only was renamed and the gender of this ‘new’ waiter was assigned. There were many “waiters” across the ten seasons and importantly this change did not affect any other occurrences, all of whom remained un-gendered. The 10 most vocal actors are detailed here: Rank Person Gender # of utterances F 9217 1 RACHEL 2 ROSS M 9031 3 CHANDLER M 8370 4 MONICA F 8335 5 JOEY M 8131 6 PHOEBE F 7461 7 MIKE M 359 8 ALL U 345 9 RICHARD M 281 F 217 10 JANICE Table 1: 10 most vocal actors 38 Anglia Ruskin University MA Applied Linguistics with TESOL And a brief overview of the corpus demographics is given here: Raw count % of corpus Total count of utterances from males 29,706 48.8 Total count of utterances from females 28,842 47.4 2,301 3.8 Total count of utterances from unknown spe akers 100% Total count of utterances from males in male only scenes 3,784 6.2 Total count of utterances from females in female only scenes 3,357 5.5 Total count of utterances from either gender in mixed-gender scenes Total count of utterances from anyone in scenes fe aturing an unknown speaker 34,667 56.9 19,040 31.3 100% Table 2: Data counts Vital to note is that in the interests of integrity all scenes which featured an unknown speaker were not included in any of the results. Obviously they were not included in the single-gender results since they have not been determined to be of either male or female gender but the decision was also taken not to include them in the mixed-gender results either. The role of these un-gendered actors is problematic. Consider the following: A scene with all three main male actors, each is happily holding their own in the ensuing discussion and then at some point they all exclaim the same phrase/word together. This synchronous proclamation could have been transcribed as: “ALL: No!” Were this scene to be included in the mixed-gender data a certain level of pollution would occur. Furthermore, the mean and the median scene length are closely aligned (20 and 17 utterances per scene respectively). Fifteen scenes have an utterance count of more than 100, and given that the average episode length is 268 utterances it is naïve to assume that scenes were accurately marked. In light of this sketchy scene demarcation it was anticipated that numerous scenes have been marked as one. This is a problem because if a ‘super-scene’ features a number of single-gender ‘mini-scenes’ and this super-scene were included in the mixed-gender statistics then, again, the data would be subject to a certain amount of pollution. The decision to discard all data featuring an unknown speaker was not one which was taken lightly as it meant effectively throwing away a third of the corpus (31.3%, see Table 2 above). As undesirable as this was it was ultimately deemed necessary in the interests of integrity. It is also unfortunate that the single-gender data is based around a mere 12% of the corpus (roughly 6% for each gender; again see Table 2), more data is always desirable and while many 39 Anglia Ruskin University MA Applied Linguistics with TESOL studies have been done with much less data more data would have been advantageous in establishing greater confidence in the results. The only solution to this problem would have been to manually view every episode and mark the scenes accurately and the gender of the actors individually a process which would have taken over 100 viewing hours. 3.2 OBTAINING THE RESULTS Regular expressions is a intensely powerful pattern matching cum pseudo-programming language in computing. The use provides a concise method for matching very complicated arrangements of strings. This concept is best illustrated with an example: The form of a tag question is relatively static: there is a statement and then the tag is the inversion of the auxiliary verb followed by a pronoun. Syntactically this gives rise to a great number of variations: i. You didn’t like it, did you? ii. He didn’t like it, did he? iii. They didn’t like it, did they? iv. They didn’t like the fact that you left them on their own while you went off gallivanting around town with someone their friend whom they didn’t like, did they? v. You like it, don’t you? vi. You like it, do you not? vii. You don’t like it, do you? viii. You liked it, didn’t you? ix. You liked it, did you not? x. You didn’t like it, did you? (not a definitive list) Also the same pattern can be repeated with other auxiliary tags (and their negatives) such as: 1) is it? 2) are you? 3) have you? 4) will you? Also notice how the tag “right” can be applied instead of any tag for example: “You didn’t like it, right?” 40 Anglia Ruskin University MA Applied Linguistics with TESOL Coupling the power of regular expressions with the power of a database makes retrieving these occurrences easier than it would otherwise be. This is an early incarnation of query used to extract tag questions from the database, and line numbers have been added to aid explanations. 1 # tag questions ("You had sex with her, didn't you?") 2 select id, person, gender, line 3 from friends 4 where metadata regexp ".*VDD.{0,15}(XX.{1,15})?PPHS1.{0,5}[[.question-mark.++” Example of “line” field: You had sex, didn't you? Example of “metadata” field: You_PPY had_VHD sex_NN1 ,_, did_VDD n't_XX you_PPY ?_? Line 1 is a comment line indicating what the query does with an example. It performs no function. Line 2 indicates the items of data the query will return. Line 3 declares the table to be queried. Line 4 is the key line. The field metadata is queried for every one of the sixty thousand utterances. If (and only if) the regular expression matches will the matched data be displayed. The regular expression (between the “” in line 4) indicates: .* = match any character (.) any number of times (*). This is used because tag questions cannot be guaranteed to occur at the start of an utterance. VDD = after the previous criteria has been met find me an utterance with this tag. This is a tag used in the CLAWS 7 tagset8, it represent ‘did’. .{0,15} = then, after no fewer than 0 but no more than 15 characters [after matching ‘did’+ do the following: (XX.{0,15})? = possibly (?) match any negation tag (XX) e.g. “not”, “n’t” and then after no fewer than 0 but no more than 15 characters do the following: PPHS1 = find a second person personal pronoun i.e. you. .{0,5} = then after no more than 5 characters [[.question-mark.]] = match a question mark. The symbol ? is a special character so this alternative form is used. Note how this query pays no attention to the front of the tag, it is purely concerned with how the tag ends. This simple regular expression will match tag questions such as: 8 http://ucrel.lancs.ac.uk/claws7tags.html (Accessed 8th June 2011) 41 Anglia Ruskin University MA Applied Linguistics with TESOL You saw him, did you? Anything can go here, didn’t you? Ross! Did you see? You said you’d be there. Didn’t you? (not an exhaustive list) While the following are missed (again, not an exhaustive list): * You saw him didn’t you! (must end with a ?) * You called her, did you not (grammatically a perfect tag question but the inversion of the subject and the negation means the pattern is not matched) * You called her didn’t you Rachel? (? must be no more than 5 characters away from the pronoun) The regular expression the results are based on is much more complicated as it attempts to deal with some of these issues (please see the Appendix for the exact expression). The MySQL web site9 provides more details about the use of regular expressions. As has been demonstrated, the construction of regular expressions is a complex area and their accurate execution relies on accurate data. It is anticipated that expressions similar grammatically to “I sometimes like you is it?” are not present in the data since no script writer is likely to get such a malformation of words past the editors and onto a primetime television show. Errors in the process of transcription however do make such structures possible but given the fact that this data was publically available for many years, again it is anticipated that these errors have been ironed out. Given this complexity, the margin for error and the amount of manual effort taken to construct such queries it is easy to see why the BNC facilitates a simple one word search (Figures 13 & 14), while third-party interfaces are still fairly rigid. 9 http://dev.mysql.com/doc/refman/5.1/en/regexp.html (Accessed 8th June 2011) 42 Anglia Ruskin University MA Applied Linguistics with TESOL Figure 13: User interface for the BNC Figure 14: Custom BNC User interface at http://corpus.byu.edu/bnc/ 43 3.3 ID Scene 1 A SUMMARY OF THE DATA AND WHAT IT LOOKS LIKE Person Gender Original Line Line Metadata File 1 MONICA F 1 JOEY M 201 1 CHANDLER M 301 1 PHOEBE F There's nothing to tell! He's just some guy I work with! C'mon, you're going out with the guy! There's gotta be something wrong with him! Alright Joey, be nice. So does he have a hump? A hump and a hairpiece? Wait, does he eat chalk? There_EX 's_VBZ nothing_PN1 to_TO tell_VVI !_! He_PPHS1 's_VBZ just_RR some_DD guy_NN1 I_PPIS1 work_VV0 with_IW !_! C'm_VV0 on_RP you_PPY 're_VBR going_VVG out_RP with_IW the_AT guy_NN1 !_! There_EX 's_VHZ got_VVN ta_TO be_VBI something_PN1 wrong_JJ with_IW him_PPHO1 !_! All_RR21 right_RR22 Joey_NP1 be_VBI nice_JJ ._. So_RR does_VDZ he_PPHS1 have_VHI a_AT1 hump_NN1 ?_? A_AT1 hump_NN1 and_CC a_AT1 hairpiece_NN1 ?_? Wait_VV0 does_VDZ he_PPHS1 eat_VVI chalk_NN1 ?_? 0101.txt 101 Monica: There's nothing to tell! He's just some guy I work with! Joey: C'mon, you're going out with the guy! There's gotta be something wrong with him! Chandler: All right Joey, be nice. So does he have a hump? A hump and a hairpiece? Phoebe: Wait, does he eat chalk? 401 1 PHOEBE F 1 MONICA F 601 1 CHANDLER M Just, 'cause, I don't want her to go through what I went through with Carl- oh! Okay, everybody relax. This is not even a date. It's just two people going out to dinner and not having sex. Sounds like a date to me. Just_RR 'cause_CS I_PPIS1 do_VD0 n't_XX want_VVI her_PPHO1 to_TO go_VVI through_II what_DDQ I_PPIS1 went_VVD through_RP with_IW Carl-_NN1 oh_UH !_! Okay_RR everybody_PN1 relax_VV0 ._. This_DD1 is_VBZ not_XX even_RR a_AT1 date_NN1 ._. It_PPH1 's_VBZ just_RR two_MC people_NN going_VVG out_RP to_II dinner_NN1 and-_NN1 not_XX having_VHG sex_NN1 ._. Sounds_VVZ like_II a_AT1 date_NN1 to_II me._NNU 0101.txt 501 Phoebe: Just, 'cause, I don't want her to go through what I went through with Carl- oh! Monica: Okay, everybody relax. This is not even a date. It's just two people going out to dinner and- not having sex. Chandler: Sounds like a date to me. 701 1 CHANDLER M 1 ALL U Alright, so I'm back in high school, I'm standing in the middle of the cafeteria, and I realize I am totally naked. Oh, yeah. Had that dream. Alright_RR so_CS I_PPIS1 'm_VBM back_RP in_II high_JJ school_NN1 I_PPIS1 'm_VBM standing_VVG in_II the_AT middle_NN1 of_IO the_AT cafeteria_NN1 and_CC I_PPIS1 realize_VV0 I_PPIS1 am_VBM totally_RR naked_JJ ._. Oh_UH yeah_UH ._. Had_VHD that_DD1 dream_NN1 ._. 0101.txt 801 Chandler: Alright, so I'm back in high school, I'm standing in the middle of the cafeteria, and I realize I am totally naked. All: Oh, yeah. Had that dream. 901 1 CHANDLER M 1 JOEY M Then I look down, and I realize there's a phone... there. Instead of...? Then_RT I_PPIS1 look_VV0 down_RP and_CC I_PPIS1 realize_VV0 there_EX 's_VBZ a_AT1 phone_NN1 ..._... there_RL ._. Instead_CS21 of_CS22 ..._... ?_? 0101.txt 1001 Chandler: Then I look down, and I realize there's a phone... there. Joey: Instead of...? 1101 1 CHANDLER M Chandler: That's right. That's right. That_DD1 's_VBZ right_JJ ._. 0101.txt 1201 1 JOEY M Joey: Never had that dream. Never had that dream. Never_RR had_VHD that_DD1 dream_NN1 ._. 0101.txt 1301 1 PHOEBE F Phoebe: No. No. No._NN1 0101.txt 1401 1 CHANDLER M Chandler: All of a sudden, the phone starts to ring. Now I don't know what to do, everybody starts looking at me. All of a sudden, the phone starts to ring. Now I don't know what to do, everybody starts looking at me. All_RR41 of_RR42 a_RR43 sudden_RR44 the_AT phone_NN1 starts_VVZ to_TO ring_VVI ._. Now_RT I_PPIS1 do_VD0 n't_XX know_VVI what_DDQ to_TO do_VDI everybody_PN1 starts_VVZ looking_VVG at_II me._NNU 0101.txt Table 3: The first 15 rows of the data 44 0101.txt 0101.txt 0101.txt 0101.txt 0101.txt 0101.txt 0101.txt Anglia Ruskin University MA Applied Linguistics with TESOL SCENE ID DESCRIPTION 1 [Scene: Central Perk, Chandler, Joey, Phoebe, and Monica are there.] 2 [Scene: Monica's Apartment, everyone is there and watching a Spanish Soap on TV and are trying to figure out what is going on.] 3 [Scene: The Subway, Phoebe is singing for change.] 4 [Scene: Ross's Apartment, the guys are there assembling furniture.] 5 [Scene: A Restaurant, Monica and Paul are eating.] 6 [Scene: Monica's Apartment, Rachel is talking on the phone and pacing.] 7 [Scene: Ross's Apartment; Ross is pacing while Joey and Chandler are working on some more furniture.] 8 [Scene: A Restaurant, Monica and Paul are still eating.] 9 [Scene: Monica's Apartment, Rachel is watching Joanne Loves Chaci.] 10 [Scene: Ross's Apartment, they're all sitting around and talking.] Table 4: Scene Dat a 45 START ID 1 5201 END ID LENGTH INTERACTION 5101 52 F,M,U 10701 56 F,M,U 10801 10901 12601 13501 13601 10801 12501 13401 13501 14401 1 17 9 1 9 14501 15901 16101 15801 16001 16501 14 M,F 2 U,F 5 M F M M,F F M In the ‘scenes’ data , interaction marked ‘M,F’ indicates both male and female participants while a scene marked ‘F,U’ indicates a female conversation with a person of undetermined gender; interaction of simply ‘F’ or ‘M’ indicate an all-girl or an all-boy conversation. ‘Length’ indicates the number of utterances in the scene. The number of interlocutors is not stored nor considered important – this information can be gleaned by using the ‘person’ field in the Friends table. The ‘start id’, ‘end id’ and ‘description’ are purely peripheral and served no purpose in this study. They were stored on the premise that they may have proved useful, and future researchers may indeed want this data. IDs in the Friends table have been incremented in steps of 100, this was done in case missing utterances were identified during the viewing of certain episodes. These missing utterances could then be ‘slotted in’ at the appropriate place. Scenes obviously have a one-to-many relationship with utterances as shown below. 10 Figure 15: Links between data 10 th Created using MySQL Workbench (http://wb.mysql.com/) (Accessed 9 June 2011) 46 4. AN ANALYSIS AND DISCUSSION OF THE RESULTS The results of this study have all revolved around what Lakoff (2004) has defined as typifying women’s language. These results have all been normalised to a common base of frequency per one thousand utterances. One thousand was chosen as the common base given the advice of Biber et. al (1998). This data has been calculated using the following formula. For example: Gender Interaction F F F M,F M M M M,F Total Utterance Count Count of “oh” Rate per thousand 3357 553 164.73 34667 2717 78.37 3784 324 85.62 34667 1695 48.89 Table 5: Normalising the data. Statistics of “oh” hedge. A brief overview of the corpus as documented in section 3 is provided again for clarity: Category Corpus Details Percentages Utterances Male Female Same-sex 6.21% 5.51% Mixed sex 28.62% 28.34% Same-sex 3,784 3,357 Mixed sex 17,419 17,248 Male Female Same-sex 38.58 44.38 Mixed sex 16.87 18.14 Same-sex 13.48 11.92 Mixed sex 6.26 6.92 Same-sex 1.32 0.89 Mixed sex 0.87 0.72 Same-sex 0.79 1.79 Mixed sex 0.23 0.66 Same-sex 0.23 0.89 Mixed sex 0 0.14 Same-sex 6.61 8.34 Mixed sex 2.74 1.99 Table 6: Corpus Details 4.1 HEDGES Category “Y’know” + “You know…” Hedges “I think…” “I’m sure…” “… sort of…” + “… sorta…” “I wonder…” “I guess…” 47 Anglia Ruskin University MA Applied Linguistics with TESOL "Could you…" "Well…" "Oh…" Same-sex 0.29 0.6 Mixed sex 0 0.12 Same-sex 51.8 48.85 Mixed sex 21.95 22.59 Same-sex 85.62 164.73 Mixed sex 48.89 78.37 Table 7: Hedges Results Lakoff coined the name ‘hedge’, so it is apt to use her definition of what constitutes one. She states that hedges are "words that convey the sense that the speaker is uncertain about what he (or she) is saying, or cannot vouch for the accuracy of the statement" (2004, p.53). A hedge is a mitigating device used to lessen the impact of an utterance, and typically, they are adjectives or adverbs, but can also consist of clauses, and they could be regarded as a form of euphemism. Hedges secure relationships and collaborative talk as they protect the interlocutors' feelings. Fishman (1978) coined the term ‘interactional shitwork’ to describe the work women have to do to maintain a conversation, and in effect women are thought to use discourse markers such as these hedging devices more in lieu of the minimal responses they get from their male interlocutors. Examples of hedges: a. There might just be a few insignificant problems we need to address. (adjective) b. The party was somewhat spoiled by the return of the parents. (adverb) c. I'm not an expert but you might want to try restarting your computer. (clause) Hedges may intentionally or unintentionally be employed in both spoken and written language since they are crucially important in communication; they also help speakers and writers communicate more precisely the degree of accuracy and truth in assessments. For instance, in “All I know is smoking is harmful to your health”, ‘all I know’ is a hedge that indicates the degree of the speaker’s knowledge instead of only making a statement, “Smoking is harmful to your health”. There are three different types of hedges (Lakoff, 2004) I. Fully legitimate – the speaker is genuinely unsure of the facts II. Justifiable – used for the sake of politeness III. Neither of the above 48 Anglia Ruskin University MA Applied Linguistics with TESOL It is this third case which Lakoff highlights as typifying ‘women’s language’. Herein lies the fundamental problem of qualitative determining which type the hedge belongs to. And even if one can objectively determine the function at an utterance level, repeating this across a 2 million word corpus is a huge problem. The concept of using ‘y’know’ as a marker of solidarity is a popular one (Schiffrin, 2001; Fraser, 1990) however the phrase also serves as an integral part of the syntax which cannot be omitted. Consider the following: a) Well, you know how we’re different. b) If you know somebody who’s there you know if you’re going to stay. c) Whether you know it or not… d) You know Jim Sellars the M.P.? e) It’s not what you know who you know. (Taken from Macaulay, 2002, pp.751-752) To grammatically say that ‘if the constituent can be omitted it is a true discourse marker, otherwise it is syntactically necessary’ is an oversimplification however again quantifying such occurrences over a vast array of data is again impossible. To identify and count each instance where “you” and “know” (or their contracted forms) occur side -by-side is trivial and the statistics presented do just that. ‘Sort of’ is no less problematic, it is and is not a hedge depending on its use; the statement: ‘he is sort of tall’ is not a hedge, it is merely descriptive. Its function as a discourse marker is only ascertainable through conversational analysis and with an intense knowledge of the interlocutors, although even then opinion may well be divided - tallness is, after all, a subjective quality. The discourse marker ‘well’ is also fraught with problems. Firstly it has several homonyms: a) As a manner adverb - She draws well. b) As a degree word - You know that perfectly well. c) As a noun - Everyone digs their own well. d) And as a verb - Tears well in my eyes. (Taken from Jucker, 1993, p.436) 49 Anglia Ruskin University MA Applied Linguistics with TESOL Secondly, Labov and Fanshel (1977) state that using ‘well’ alludes to an item of shared knowledge going on to detail how its use shows a “joint concern” (p.157) of the topic at hand. Quite how a third party can ascertain whether a statement is shared knowledge or not remains unknown. The use has many other different interpretations (see Schiffrin, 1985, Lakoff, 1973, 1973b) from indicating an insufficiency in the answer to acting as “a qualifier and as a frame” (Jucker, 1993, p.437) marking a direct response to the utterance before it. A: Did you kill your wife? B: Well, yes…. (Lakoff, 1973b, p.459) Statistically (see Figure 16 & 17 on the next page), ‘oh’ and ‘well’ are the most heavily used markers and interestingly these are also only two discourse markers which Schiffrin (1987) describes as having no meaning. In line with the observations of Sacks et al. (1974) the data supports the conclusions that occurrences of ‘well’ overwhelmingly begin turns. The Friends data shows that where these markers are used it is found that there is a heavy bias to use them in single-gender conversations, and generally females prefer their use more than males in single gender conversation. The use of almost all of these devices heavily outweighs occurrences in the BNC – an unsurprising fact given the broad genre of speech acts in the BNC. While the use of ‘sort of’ may be of comparable frequency, this is just 0.7 occurrences per thousand. The data shows that females prefer the use of ‘oh’ and ‘sort of’ almost twice as much as males and interestingly they have an overwhelming tendency to use these devices in the company of other women. ‘I think’, ‘Y’know’, ‘I guess’ and ‘Well’ are all used approximately twice as much in single gender conversations as the converse regardless of gender. Rachel and Phoebe frequently use ‘oh’ as a single word interjections. Rachel uses it 50% more than Phoebe who uses it twice as much as any male character, these interjections are in effect back-channels (discussed later) as they do not denote an intention to speak. On the function of hedges, the data disagrees with Lakoff’s definition that hedges are women’s language. 50 Anglia Ruskin University MA Applied Linguistics with TESOL Use of Hedges #1 180 Utterances per thousand 160 140 120 100 80 60 40 20 0 I think… ...Y'know… I guess… Well,… Oh… Female speaker - Female only conversations Female speaker - Mixed sex conversations Male speaker - Male only conversations Male speaker - Mixed sex conversations BNC Spoken Figure 16: Use of Hedges #1 Utterances per thousand Use of Hedges #2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 I'm sure… Could you… ….please? I wonder… …sorta/sort of… Female speaker - Female only conversations Female speaker - Mixed sex conversations Male speaker - Male only conversations Male speaker - Mixed sex conversations BNC Spoken Figure 17: Use of Hedges #2 51 Anglia Ruskin University MA Applied Linguistics with TESOL 4.2 TAG QUESTIONS Tag Questions Category Positive stem, negative tag Negative stem, positive tag “, right?” Male Female Same-sex 0.26 1.49 Mixed sex 0.66 0.55 Same-sex 2.64 2.09 Mixed sex 0.78 1.12 Same-sex 6.34 2.09 Mixed sex 2.71 1.67 Table 8: Tag Questions Results Tottie & Hoffmann (2006) highlighted the 15 most frequent tags according to their data from the BNC and the Longman Spoken American Corpus (LSAC), priority was given to ensuring that the pattern matching expression retrieved these tags. "To my knowledge there is no syntactic rule in English that only women may use. But there is at least one rule that women will use in more conversational situations than a man. This is the rule of tag-question formation" (Lakoff, 1973, p.53) The whole premise behind studying tag questions is that are epistemic modal tags express uncertainty and unassertiveness and that in a “male dominated society, women are brought up to think of assertion, authority and forcefulness as masculine qualities which they should avoid” (Cameron et al., 1988, p.76). They have come to characterize women’s speech as highlighted in this quote by Lakoff above. Pedagogically Swan (2005) states that intonation is the key the determining the meaning, differentiating the difference that if we want to know something but are not sure of the answer we use a rising intonation; conversely if the tag is not a real question i.e. we are sure of the answer, we use a falling intonation . Dubois and Crouch have already faced the problem of identifying the different intentions of tag questions. In 1975 they set out to disprove Lakoff’s theory about tag questions and concluded it her hypothesis be invalid, in other words, they found that men used more of this particular structure than women this 1001201 EDDIE You had sex with her didn’t you? conclusion on its own is rather enlightening however more information is needed as their study involved only 33 tag statements all by men in a business conference. If men are vocal on the topic of abortion then in a business environment one would expect also expect a greater tendency to talk. Holmes (1983) has also commented that men may be using this function in an assertive challenging way. 52 Anglia Ruskin University MA Applied Linguistics with TESOL Tag Questions per thousand utterances Longman Spoken American Corpus BNC spoken (context-driven) BNC spoken (spontaneous) BNC spoken (average) "Friends" female mixed conversations only "Friends" female conversations only "Friends" male mixed conversations only "Friends" male conversation only 0 2 4 6 8 10 Tag Questions per thousand utterances Figure 18: Tag Questions This data refutes the association between women and tag questions. Males use tag questions approximately twice as often as women in a single-sex environment and males’ use in mixed gender conversations is marginally more than females’ use. As can be seen from the graph the BNC spontaneous speech aligns closely to both the females single -gender use and the males mixed gender use providing evidence a reliable benchmark and allowing the conclusion that despite the studio setting this feature appears to have been used naturally. Where males do use tag questions, they have a strong prejudice – approximately 2:1 in each conversation setting- to use the “,right?” tag instead of the conventional inverted auxiliary, and this is not a noticeable trait with the females . In these instances, the use of the “’right?” tag is used overwhelmingly as a negative tag with a positive stem (e.g. “There’s more beer, right?”). Of interest is that females do use positive stems with negative tags more than men, approximately 5 times as much although again by minimal amounts. For example: 1673801 Joanne to Rachel …You didn’t tell him not to call me, did you? 53 Anglia Ruskin University MA Applied Linguistics with TESOL Average per thousand utterances Tag Questions Details 6.34 7 6 5 4 3 2 1.49 1.12 1.67 0.55 1 2.71 2.64 2.09 2.09 0.26 0.66 0.78 0 Female speaker, all female conversation Female speaker, mixed-sex conversation Positive stem, negative tag Male speaker, all male conversation Negative stem, positive tag Figure 19: Tag Question Details 54 Male speaker, mixed-sex conversation Right? Anglia Ruskin University MA Applied Linguistics with TESOL 4.3 ASKING QUESTIONS Q's Category Asked Male Female Same-sex 345.7 306.2 Mixed sex 161.9 153.4 Table 9: Questions Macaulay(2001) investigated the questioning strategies of 4 reporters, 2 male and 2 female, 1 working for CNN and 3 working for CBC (Canadian Broadcasting Company). She found that the strategies were the same between the genders across a total of 23 interviews. Males preferred direct questioning (40% & 41% vs. 35% and 35%) while female preferred indirect strategies (37% & 31% vs. 19%, 21%) Coates investigating question use between the sexes wrote that "questions can be used to seek information, to encourage another speaker to participate in talk, to hedge, to introduce a new topic, to avoid the role of expert, to check the views of other participants, to invite someone to tell a story" (1996, p.176). She documented that the use of questions was different depending on the gender with women using far fewer true ‘information seeking’ questions than men. She noted that the “maintenance and development of friendship” (p.176) was the primary goal in asking question unlike men who use them in a more direct ‘tell me something I don’t know’ kind of way. The regular expression which garnered these results is incredibly simple (see the Appendix), checking only for the existence of a question mark somewhere in the utterance. Numerous trials were undertaken but with auxiliaries being dropped and multi-word subjects/objects it was problematic to say the least. In essence any part of speech can be followed by a question mark in the correct context. Consider: a. Hot? (adjective phrase) b. A book? (noun phrase) c. Really? (adverbial phrase) Therefore these results rely wholly on the accurate transcription of the spoken dialogue, and questions from the canonically explicit to the ambiguously indirect are represented in this one statistic without differentiation. 22801 RACHEL Guess what…? 55 Anglia Ruskin University MA Applied Linguistics with TESOL The rationale behind the importance of this statistic is that Lakoff bemoaned that asking questions represents a linguistic insecurity, “resulting from the oppression of women” (Fishman, 1978., p.400) while such propaganda might have had support twenty years ago today such a statement is laughable. Fishman observed that women ask almost 300% more questions than men. This is paradoxical since requests for information – typically realised by questions are, by definition, aggravating or face threatening. Fishman hypothesized that females' greater use is because of their attempt to solve the conversational problem of gaining a response to their utterances (see Rachel’s utterance on page 55). Questions such as this function as a conversational lubricant, greasing the wheels of small talk. Much like the child who has just discovered the phrase ‘…but why...?’; questions invariably “evoke further utterances” keeping the conversation alive (p.400). Freed and Greenwood (1996) found few differences with regard to the number of questions asked in conversation between the genders, however their sanitised equipment-laden interview room and the parameters of the 'spontaneous talk', 'considered talk', and 'collaborative talk' elements raises eyebrows. The data which shows near ubiquity in a mixed-gender setting while it is males who are more inquisitive than their counterparts in single-gender interaction thus denying claim for Fishman’s hypothesis. However, the phrase “guess what…?” was more popular among the women by a normalised ratio of 3:2 and interestingly all instances of this phrase (of which there were just 20) occurred in mixed-gender conversations. Questions per thousand utterances Do men ask more questions? 350 300 250 200 150 100 50 0 Male Female Same-sex Figure 20: Asking que stions 56 Mixed sex Anglia Ruskin University MA Applied Linguistics with TESOL 4.4 TABOO LANGUAGE Taboo Category Top 10 taboo words Male Female Same-sex 24.05 47.36 Mixed sex 22.04 42.27 Table 10: Taboo words Results Searching the corpus looking for specific taboo words is a troublesome area; given that we have more words for boobs than Eskimo’s have for snow (Pullman, 1989) it was a relief to discover that ten words have been responsible for 80% of all swearing consistently over the last two decades (Jay, 2009), and this vocabulary provided a good starting point to investigate taboo language in the corpus. While it will come as no surprise that the corpus is void of such offensive words as ‘shit’, ‘fuck’, ‘hell’ and ‘Jesus Christ’ the other six lexical items were all present with varying degrees of frequency. Table 10 & 11 are irrefutable and show that women outright outperform men when it comes to the use of profanities. Indeed the most profuse male (Ross) is barely half as impolite as the least distasteful female. Potty-mouth Rachel is by far the worst offender and women seem as content using such language in the company of their male peers as they do with their own kind. Actor RACHEL PHOEBE MONICA ROSS CHANDLER JOEY Gender F F F M M M Frequency per 1000 38.19 31.10 28.07 16.72 16.61 14.76 Table 11: Who uses taboo language? Important to note is that the frequencies in Table 12 (on page 58) have been normalized per actor per scene. For example: Rachel contributes a combined 1015 utterances during all of her involvement in all singlegender scenes. In these scenes she utters 75 expletives giving her an frequency of [almost] 74 instances per thousand utterances in single gender conversations. In contrast (Table 11) she contributes a grand 352 swear words in a 9217 utterance show total giving her a frequency of 38.19 per thousand utterances regardless of conversation parameters. 57 Anglia Ruskin University Actor Gender MA Applied Linguistics with TESOL Scene Rudeness Interaction RACHEL F F 73.89 RACHEL F M,F 50.09 PHOEBE F M,F 46.22 MONICA F F 45.95 MONICA F M,F 38.80 PHOEBE F F 32.69 ROSS M M,F 24.64 JOEY M M 24.62 CHANDLER M M 24.00 CHANDLER M M,F 22.73 JOEY M M,F 19.96 ROSS M M 18.36 Frequency per 1000 Table 12: When is taboo language used? Women are, far and away, the ‘bluest’ when it comes to using taboo language there is a very clear distinction between them and the men, and at an utterance level men are almost shy to utter a taboo word: 303401 ROSS Alright, alright. We're all adults here, there's only one way to resolve this. Since you saw her boobies, I think, uh, you're gonna have to show her your peepee. The usage of “god” is ‘off the chart’ to the extent where its frequency far exceeds that of its nearest rival (see Figure 21 & 22) and while Phoebe has a soft spot to use ‘ass’ Joey appears to have a penchant to exclaim using ‘hell’. The frequency of “god” in the friends corpus dwarf the number of occurrences in the BNC spoken subset – by a factor of more than 100 compared to Rachel. The other 5 words also appear more frequently in Friends than they do the BNC. 58 Anglia Ruskin University MA Applied Linguistics with TESOL Frequency per thousand utterances God 43.94 45 40 33.59 33.78 35 30 25 20 17.83 14.81 12.05 15 10 0.42 5 0 Chandler Joey Ross Monica Phoebe Rachel BNC Spoken Figure 21: Taboo language per actor 1 Frequency of taboo words per actor 5 Freuqency per thousand utterances 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Damn Chandler Hell Joey Ross Ass Monica Bitch Phoebe Rachel Sucks BNC Spoken Figure 22: Taboo language per actor 2 Courtney Cox, one of the 6 main actors was the first actress to use the taboo word “period” on US television in a Tampax advert in 1985 (Wikipedia, 2011b). Fittingly she is the only female character to utter the same word [in the intended context] throughout the show. Although it is not just in this instance that women push the boundaries. 59 Anglia Ruskin University MA Applied Linguistics with TESOL When ‘boobs’ are used with a pre-modifying adjective it is the females who are more creative. Men predictably obsess about size while women are more critical although almost all connotations are positive. Firth (1935 cited in O’Keeffe et. al., 2007, p.59) argued that the meaning of a word is as much a matter of how it combines with other words (i.e. its collocations) as its own meaning and such collocates have Whorfian implications for those who watch the show and for society generally. No similar statistics are presented for parts of the body as there was no data to support this. Pre-modifying adjective % (gender) Big/bigger/biggest 20% (f), 40% (m) New 20% (f) Fake 10% (f) Nice 10% (f) Table 13: Collocates to the left of 'boobs' 60 Anglia Ruskin University MA Applied Linguistics with TESOL 4.5 EMPTY ADJECTIVES Adjs. Category Empty Male Female Same-sex 59.46 60.17 Mixed sex 33.92 32.65 Table 14: Empty Adjectives Results For the purpose of this study an empty adjective has been defined to represent some abstract property. Whereas ‘hot’ or ‘wet’ are concrete adjectives, independently verifiable as to their existence, so-called empty adjectives are subjective or more subjective than physical properties. The empty adjectives analyzed in this study are limited to: Gorgeous Wonderful Divine Pretty Lovely Great Good Fantastic Charming Sweet Adorable Cool This list was taken from Lakoff (2004, p.45) Empty Adjectives #1 Utterances per thousand 30 25 20 15 10 5 0 pretty good cool great Female speaker - Female only conversations Female speaker - Mixed sex conversations Male speaker - Male only conversations Male speaker - Mixed sex conversations BNC Spoken Figure 23: Empty Adjectives 61 Anglia Ruskin University MA Applied Linguistics with TESOL Empty Adjectives #2 Utterances per thousand 1.2 1 0.8 0.6 0.4 0.2 0 gorgeous wonderful lovely fantastic charming adorable Female speaker - Female only conversations Female speaker - Mixed sex conversations Male speaker - Male only conversations Male speaker - Mixed sex conversations BNC Spoken Figure 24: Detailed view of negligible 'empty' adjectives Much like the data for ‘hedges’, the results per scene type are remarkably similar with a clear single gender bias. Both genders use the same subset of empty adjectives with the highest frequency (the four in Figure 23), and it is interesting to note that it is males who make more use of ‘pretty’ regardless of the company a trend which applies to ‘good’ and ‘cool’ also. The qualification here is that 498501 JOEY Joe Stalin. Y’know, that’s pretty good. Americans use ‘pretty’ like we British would use ‘rather’ (Longman, 2011) and this would also explain the low frequency of ‘pretty’ in the BNC. As a rule both genders use more empty adjectives in the company of their own gender but overall it was males who make slightly more use of these parts of speech than women. There is no clear support for Lakoff’s assumptions. A further caveat in this data is the reported speech aspect. For example (at ID 4417801) Joey utters “…he said he thought you were charming”. Here Joey is reporting what someone else thought not what he thought. This instance was identified when validating the data and it was the only instance which was identified. While the reported speech is of a male and the speaker a male this is purely coincidental, and excluding reported speech items such as this from the data was not attempted. 62 Anglia Ruskin University MA Applied Linguistics with TESOL 4.6 INTENSIFIERS Intensifiers Category That's so cool! That's very cool That's really cool Male Female Same-sex 12.95 35.15 Mixed sex 17.17 31.83 Same-sex 5.81 6.95 Mixed sex 7.41 7.65 Same-sex 13.21 22.73 Mixed sex 16.3 20.26 Table 15: Intensifiers Results Linguistically there is little difference between saying ‘I like you very much’ or ‘I like you so much. Both devices are very similar, they are intensifying degree adverbs and as such their use is interchangeable. The theory behind the use of ‘so’ is that it evades declaring ones strong feelings, a reserved form of ‘very’ where one does not dare make it clear how passionately one feels are; using 'so' Lakoff states "weasels [out] on that intensity" (2004, p.55). The data supports the preferred use of ‘so’ and it is more than twice as prevalent than ‘very’ in males’ speech while females prefer ‘so’ to ‘very’ by a normalised ratio of almost 4:1. Females also use 'so' almost 3 times as much as men in single gender conversation while they use it almost twice as much as men in mixed gender conversation. The usage of 'very' has close parallels and ‘really’ is, again, preferred by women regardless of conversation type. ‘Really’ and ‘so’ are the first two constructs which could reasonably be considered ‘women’s language’. ‘Really’ however is a slightly different construct (a true adverb) while ‘so’ and ‘very’ are degree adverbs so true comparisons are unfair but the point remains that women use it more than men. The data facilitated an easy breakdown of the use of these features by gender and by time; it is interesting to see the patterns of usage (see Figure 25 on the next page). The ebbs and flows in usage cannot be easily explained, the show features many weddings and the birth of a number of children however the diversity of these trends is too broad to categorize as being emotionally related to these events. Tagliamonte & Roberts (2005) also investigated this and also failed to correlate the use with anything meaningful, the best they could do was to highlight the parallels between the viewing figures and emphatic ‘so’. 63 Anglia Ruskin University MA Applied Linguistics with TESOL Use of Intensifiers 46 Occurances per thousand 41 36 31 26 21 16 11 6 1 0 2 4 6 8 10 12 Season Male: so + adj. Female: so + adj. Male: very + adj. Female: very + adj. Male: really + adj. Female: really + adj. Figure 25: Diachronic results use of m ain elements over a 10 year period. The females’ use of ‘so’ and ‘really’ is almost untouched by men’s use. Only in season 9 do men start to use ‘so’ as much or more than women use ‘really’. Where females do use ‘so’ in an utterance it was discovered that they were three times more likely to repeat the use again in the same utterance. E.g. 5866401 RACHEL This is so awesome. College guys are so cute Although this pattern is not consistent: 5908101 MIKE : Phoebe you're so beautiful. You're so kind, you're so generous. You're so wonderfully… And females are also solely responsible for the only two intensifier utterances 1337201 RACHEL I’m so dead sorry 5441401 MONICA I’m so so sorry* (* more repetitions of ‘so’ possibly removed as detailed in section 3.1) 64 Anglia Ruskin University MA Applied Linguistics with TESOL 4.7 VOCABULARY The CANCODE (Cambridge and Nottingham Corpus of Discourse in English) is a five million word spoken English corpus. It provides a good benchmark against which to view the ‘Friends’ data. # Friends corpus11 1 I don't know 2 I know I 571 A lot of 574 3 I mean I 548 I mean I 437 4 I can't believe 434 I don't think 435 5 I think I 346 Do you think 302 6 I don't think 336 Do you want 285 7 I have to 320 One of the 266 8 I love you 288 You have to 260 9 Don't know I 265 It was a 255 10 Know I know 259 You know I 246 per million 1,027 CANCODE12 per million I don't know 1,062 Table 16: Top 10 three-word clusters First and foremost these results (Table 16) provide a great deal of validity to the corpus and there are numerous similarities (marked) both in terms of actual phrases and the alignment of the frequencies. Such evidence provides a great deal of proof for Pawley and Syder’s (1983) assertion that we, as native speakers, share a common core of vocabulary and of prefabricated sequences and lexical bundles are seen as the “basic building blocks of discourse" (Biber et. al., 2004, p.271). There is an argument that these are grammatical structures and not off-theshelf fillers such as ‘if you ask me’ but the point remains that communicative competence is represented by giving the correct responses at the correct time and these chunks serve to lessen the communicative burden on the viewer allowing them to greater appreciate the other content while sharing this common schema of vocabulary and rules. This harmonious speakerviewer relationship allows the viewer to apply their knowledge to the performance and not to the language being used. Socio-culturally the most striking discrepancy between the two corpora is the occurrence of ‘I love you’ in the Friends corpus which occurs almost two hundred times and which the distribution is almost identical between the genders. 11 12 th Produced using WordSmith, http://www.lexically.net/wordsmith/ (Accessed 28 May 2011) Data taken from McCarthy (2006) and normalized. 65 Anglia Ruskin University MA Applied Linguistics with TESOL Male Female Any 10.34 10.26 Same-sex 11.23 10.79 Mixed sex 10.33 10.26 Any 10.34 10.29 Same-sex 11.7 11.36 Mixed sex 10.81 10.73 Approx. size N/A 326,486 314,627 Base List 1 N/A 87.71% 87.49% Base List 2 N/A 2.94% 2.81% Base List 3 N/A 1.29% 1.31% N/A 8.06% 8.39% Mean length of utterance (words) 5% trimmed mean (words) Vocab. Averages Category Not in 1,2 or 3 Table 17: Vocabulary Results The table above shows that males’ utterances are statistically always longer and in their total vocabulary larger although both genders appear to use the same main subset of language for the vast majority of their speech - just 1000 words cover almost 90% of all speech by either gender. One reason for a smaller female vocabulary could be because women take fewer risks with their vocabulary although breaking down the vocabulary used by word list (Figure 26 below) shows that while women’s vocabulary is smaller at a word-list level, this is only true by insignificant yet consistent values. Vocabulary Breadth Vocabulary per word list (maximum 1000) Male Female 973 969 855 816 682 637 493 456 376 361 295 260 208 196 170 160 Base List 1 Base List 2 Base List 3 Base List 4 Base List 5 Base List 6 Base List 7 Base List 8 Figure 26: Vocabulary Breadth 66 Anglia Ruskin University MA Applied Linguistics with TESOL At an actor level there are some important differences. Despite speaking the most Rachel’s vocabulary is the second smallest this gives her the lowest ‘innovation’ score of all the characters. Conversely Chandler who speaks 12% less than Rachel has a broader vocabulary by a difference of some 14% giving him the most varied vocabulary. Phoebe who speaks the least appears to be very inventive with her utterances registering an ‘innovative’ score of 6.4. According to Eckert (1989) Rachel might be speaking the most, not because she has the most to say but because she has the most work to do re-affirming relationships, this is purely speculatory and only the purest form of conversational analysis would affirm such a claim. The girls and Joey have the narrowest vocabulary by unique words, in contrast Chandler and Ross are some 500 unique words ahead of these other four characters. This variety helps to give Chandler the greatest ‘innovative score’ of all 6 actors, a metric which marks him as the most creative. The boys have an average vocabulary of 5,776 unique words while the girls are more than 450 words behind on 5,314 meaning the boys have a 9% broader vocabulary, this is despite speaking only 2% more than the girls. Therefore the data clearly shows that the men in this study do have a bigger vocabulary than the women. Person # of # of spoken # of unique words ‘Innovative’ Score 14 utterances words CHANDLER 8,370 91,355 6,001 6.57 PHOEBE 7,461 86,497 5,539 6.40 JOEY 8,131 91,731 5,414 5.90 MONICA 8,335 87,269 5,148 5.90 ROSS 9,031 100,776 5,914 5.87 RACHEL 9,217 102,707 5,256 5.12 Table 18: Unique words per actor (ordered by ‘innovative’ score) Lakoff (2004) claims that while woman might say a colour is mauve while a man will call it light purple. Emphatically there was no support for this claim. Both genders used the same 12 word subset for the majority of their colours (black, blond, blue, brown, gray, green, pink, purple, red, tan, white, yellow). To this list females added three colours: gold, olive, orange; while males added a different three: amber, maroon, silver. The freque ncy of these colours was so low that each instance was inspected to ensure that it was used in the correct context. Hence there is no support for the belief that women are more descriptive with regards to colour. The only mildly interesting statistic from this was that men’s use of the colour red was more than two times that of the women’s use (26 vs. 12). Further analysis revealed a bias with 14 =(# of unique words/# of words) * 100 67 Anglia Ruskin University MA Applied Linguistics with TESOL regards to collocates, with females much more likely to describe something as little and/or cute while men were likely to describe something as big or huge: Male collocates with colours Big Huge Little Pretty Middle-aged Long-stemmed Wavy Female collocates with colours Little Cute Stunning Bright Favourite Big Pretty Table 19: Collocates with colours Women, of course, have a large stock of words related to their specific interests, generally regarded to them as 'woman’s vocabulary’, similarly men use vocabulary which is semantically masculine. Women's Vocabulary sexier sympathy arrogant craving prince iron integrity groceries cries headache inviting mild elegant poop kittens swelling long-term apologizing contraction g-string cramp relatives Men's Vocabulary courtside independent graduate threesome freedom champion neurosurgeon wrecking hockey legitimate brutal maniac sport playstation laughs swearing silliness scored razor valuable slack cars Table 20: Gender specific vocabulary (not a definitive list) Del-Teso-Craviotto (2006) analysed the vocabulary in four women’s magazines for the semantic categories used to approach their female readers. There are many parallels between her study and this. Just as women’s magazines try to emulate what their editors suppose is 68 Anglia Ruskin University MA Applied Linguistics with TESOL the language of their readership, presumably so too have the editing staff of Friends. Addressing women [viewers/readers] with casual but appropriate language, allows the characters/magazines to present themselves as friends (del-Teso-Craviotto, 2006). The high degree of overlap in the genders vocabulary also signals a shared communal concern for one another’s problems while individual gender-specific vocabulary is presented incorporating both traditional and progressive elements . I propose that part of the appeal of the show was that women were seen as progressive and the choice of vocabulary was integral to this ideology. Cixous (1975) stresses that many binary oppositions are gendered, with men associated with activity, culture, the head and rationality, whereas women are associated with passivity, nature, the heart and emotionality. Certainly such claims have been validated given the preceding groups of vocabulary however quantifying such claims more accurately would have been made easier with this semantic tagging (discussed in the conclusion). 69 Anglia Ruskin University MA Applied Linguistics with TESOL 4.8 BACK CHANNELLING Tottie (1991) defines backchannels as “the sounds (and gestures) made in conversation by the current non-speaker, which grease the wheels of conversation but constitute no claim to take over the turn” (p.255). ID Person Utterance 109901 ROSS …say sevenish? 110001 RACHEL Sure. 353001 JOEY You want to see her again, right? 353100 ROSS Sure. Table 21: Not back channelling No statistics are presented for back channelling as identifying them was exceedingly difficult. Table 21 above shows two examples of what are clearly not back channels but simply responses to previous utterances. The issue arises with knowing if an utterance is a response to a question or a signal of attention neither of which can be easily ascertained without a level of interpretive annotation. Table 22 below shows an example of back channelling which was identified manually. In this instance by Rachel. ID 49301 49401 49501 49601 49701 49801 49901 50001 50101 50201 50301 Person RACHEL BARRY RACHEL BARRY Utterance And you've got lenses! But you hate sticking your finger in your eye! Not for her. Listen, I really wanted to thank you. Okay. See, about a month ago, I wanted to hurt you. More than I've ever wanted to hurt anyone in my life. And I'm an orthodontist. RACHEL Wow. BARRY You know, you were right? I mean, I thought we were happy. We weren't happy. But with Mindy, now I'm happy. Spit. RACHEL What? ROBBIE Me. RACHEL Anyway, um, I guess this belongs to you. And thank you for giving it to me. BARRY Well, thank you for giving it back. ROBBIE Hello?! Table 22: Rachel back channels (scene #24) 70 Anglia Ruskin University MA Applied Linguistics with TESOL 5. CONCLUSIONS This project set out, as I am sure many do, with lofty goals. Without doubt the benefits of a computerized corpus of speech is a valuable asset and what computers have always done flawlessly is objective decision making based on the parameters specified. This study much like many before it has suffered from a “methodological weakness” (Holmes, 1986, p.4) with a tendency to simply quantify linguistic structures in the data with little regard to the context of the items, nor attention to the functional correlations of such use. The example of the function of tag questions is just one such example of this weakness. Syntactically defining all of the possible variations needed to extrapolate such instances from a corpus, while not impossible, is a never-ending task and further classifying such instances based on their pragmatic meaning is fraught with problems. [George] Lakoff came to the same conclusions when he lamented that “natural language concepts have vague boundaries and fuzzy edges; …consequently, natural language sentences will very often be neither true, nor false, nor nonsensical, but rather true to a certain extent and false to a certain extent, true in certain respects and false in other respects” (Lakoff, 1973, p.458). Probing tag questions further, is “isn’t it?” the same as “, right?” the same as “, okay?”? E.g.: 1) You’ll do it, won’t you? 2) You’ll do it, right? 3) You’ll do it, okay? To a native-speaker’s ear the third sentence sounds not like a question proper but like an imperative and with different intonation each sentence could be portrayed similarly. All of these caveats, qualifications and assumptions have, unfortunately, left the data this study has presented with some issues. Given the data as it stands, there is certainly a difference between inter-group and intra-group speech but this study has not found any meaningful data to support Lakoff’s claims about women’s speech. There is no support that women use more ‘women’s language’ and the view that the stereotypes of women being linguistically restricted is not upheld, although it has been shown that while men generally have a broader vocabulary and use it more creatively. Despite this, from the use of intensifiers to taboo words women are a country mile ahead of masculine counterparts, while in other features, from empty adjectives to the number of questions asked, colours to hedges, both genders have been found to operate on remarkably similar levels. The results that people alter their conversational strategy based on the 71 Anglia Ruskin University MA Applied Linguistics with TESOL gender(s) of their partner(s) is not new (e.g. Boulis & Ostendorf, 2005) and that theory has been upheld with this study. Social identities arising from memberships of the same or different communities of practice (McConnell-Ginet, 2003) may begin to explain both the discrepancies and the alignments. A community of practice is defined as a group of people “brought together by some mutual endeavour, or common enterprise… and to which they bring a shared repertoire of resources, including linguistic resources, and for which they are mutually accountable.” (McConnell-Ginet, 2003, p.71, emphasis added). Therefore there is a responsibility of all involved in the conversation to adhere to the cohesion and fluency of the ensuing conversation. It is also understandable that a group of friends are linguistically similar and that, from Eckert & McConnell-Ginet’s (1999) Asian Wall to the sofas of Central Perk, these characteristics may be part of the glue which holds friendships together. 72 Anglia Ruskin University MA Applied Linguistics with TESOL 6. LIMITATIONS OF THE STUDY The corpus is already a valuable asset however a near infinite number of improvements can be suggested. Phonetic representation, a consistent and detailed level of annotation are aims which, as many corpus linguists have attested to, are unfeasible on a wide range of levels. Francis and Hunston’s definitions of ‘the acts of conversation’ (1985) are detailed and comprehensive. Acts come together to form ‘moves’ and these again have members (eliciting, answering) to have had a corpus with anything near this level of annotation would have both validated the results to a greater extent and opened up so many more doors. While it is appreciated that this is a manual and highly subjective area susceptible to wide degrees of error any consistently applied framework would have been beneficial. Critical discourse analysis is as interested in what is not said as what is said and while the ability to programmatically retrieve occurrences of ‘so + adjective’ structures is useful, what a corpus cannot easily tell us is where such structures could exist but don’t. Frameworks such as Halliday’s SFL can be applied methodically although the amount of information harvested becomes insurmountable with only a dissection of a single-paragraph advert. Invariably being forced to look at each utterance as an atomic element was by far the biggest drawback of this project. “Even if a case could be made for the autonomous treatment of some aspects of the language, discourse cannot be satisfactorily analyzed in a vacuum” (Lakoff, 2001, p.200). Corpus annotation has the scope to make such assumptions concrete however the level of annotation needed to satisfactorily while small on an utterance level becomes prohibitive in a seventy thousand utterance corpus. With reference to comparisons to British television, it was outlined in section 1.3 that few studies into British TV specifically have been done. Few British comedies exist which have been transcribed (either officially or by a faithful fan base) or which have run for 200+ episodes. Retrospectively it is with relief that this avenue was not accomplishable, putting together one corpus was difficult enough and the invariable inconsistencies in transcription, I anticipate, would have caused numerous problems. The value of the study would have also been of questionable use using not just one scripted programme but two. 73 Anglia Ruskin University MA Applied Linguistics with TESOL 6.1 SEMANTIC ANNOTATION Semantic tagging is starting to come of age, and Lancaster’s semantic tagger already has a decade of history behind and its large scale pilot in the ICE project may just give academics the incentive to investigate and push its limits. It is easy to see how such an option would be useful (if you are in any doubt please consult the query which was used to glean the colours the genders use from the corpus, found in the appendix). Coupling the power of a grammatical representation with a semantic representation could be easily accommodated – essentially another column in the database (see Figure 27 for a mock query). To all intents and purposes it would have provided the best of both worlds where should grammatical functions be the focus the POS mark-up could be interrogated or should genres of vocabulary be the focus the semantic mark-up could be interrogated as in this example? Utterance: I like a particular shade of lipstick Grammatical mark-up: I_PPIS1 like_VV0 a_AT1 particular_JJ shade_NN1 of_IO lipstick_NN1 Semantic mark-up: I_Z8 like_E2+ a_Z5 particular_A4.2+ shade_O4.3 of_Z5 lipstick_B4 19 As an example E2+ signifies the word belongs to the category `emotional states, actions, events and processes' (E), subcategory `liking and disliking' (E2), and refers to `liking' rather than `disliking' hence ‘E2+’. Grouping by semantically related sets (e.g. ‘lipstick’ belongs to the ‘cleaning and personal care’) would have opened up more opportunities to explore, at a lexical level, the language the genders use both inter-gender and intra-gender. It is somewhat regrettable that this option was not fully explored. select id, person, gender, line from friends # match structure 'pronoun + conj. + noun' e.g. "I like tennis" where GrammaticalPOSData regexp ".*PPIS1.*CS.*NN.*" # refine to match only food nouns e.g. "I like sushi". and SemanticPOSData regexp ".*Z8.*E2+.*F1.*” Figure 27: Mock query to interrogate both grammatical and semantic representations. 19 Full tagset available at: http://ucrel.lancs.ac.uk/usas/semtags.txt (Accessed 12th May 2011) 74 Anglia Ruskin University MA Applied Linguistics with TESOL The process of getting semantically tagged data would have paralleled the steps taken to get grammatically tagged data and far from being a manual process at an utterance level (like the frameworks of Halliday, Francis & Hunston) the process is entirely automatic requiring the analyst to simply ‘marry’ the produced output with the input and then import the data into the database. 6.2 CONNECT BY PRIOR Databases can be fickle systems while they make life easy on the one hand, they complicate things on the other. There has been only one area where the database has been a hindrance rather than a help and that is in relating utterances to one another. Oracle and other competing commercial relational database systems offer a function whereby one can query a row based on the row prior to it. It is called ‘connect by prior’, and in essence it allows something like this pseudo-code example: Give me all the rows where the row is just a one word utterance but where the row prior to it doesn’t end in a question mark. Fundamentally there are sound reasons why this is not allowed, database management systems are flat and typically transactional; rows are either queried in isolation or by column and never in relation to one-another. The absence of this function posed problems when trying to ascertain the amount of back channelling which occurred. It was anticipated that there wouldn’t be much – due to the studio format – a camera change/cut away for a one word non-interruption was considered unlikely however one word responses existed in abundance in the corpus and being able to even to spot check a handful of them manually could have been useful. Even with this facility, the problem of identifying back channelling is still not simply as knowledge of the next speaker does not necessarily denote that the two utterances are related. This issue aside, the process of identifying them would have been one step closer. While the regular expressions are my best attempt there are fundamental problems with the retrieval of such data. Stubbs (1996) analysed two very short texts; one for boys and one for girls. The frequency of ‘happy’ and ‘happiness’ were similar in both texts, suggesting an equal importance. However, through detailed analysis Stubbs found that the speech to boys was instructed them to live happy lives, whereas the speech to girls was telling them to make other people happy. Similar results are possible in this study and (again) only a detailed conversational analysis would have avoided these problems. 75 Anglia Ruskin University MA Applied Linguistics with TESOL 7. RECOMMENDATIONS FOR FURTHER STUDY The possibilities for further study using this corpus are broad. MacFadden et. al. (2006) compiled a TV word list and computed the necessary vocabulary needed for comprehension and this is one possible area. Semantic tagging is still in its relative infancy however it is available for academic use, this is probably the most potent area of future research. The level of politeness is inseparably related to the social distance between the two (or more) parties, and the greater the social distance the higher the degree of linguistic respect is likely to be expressed (Wolfson, 1998) therefore it would have interesting to know to whom polite language is directed and the effect a ‘stranger’ (none of the main 6 actors) has on the politeness of the language. This is an interesting area if study although methodical conversational analysis would be required. 76 Cameron, D. 2005. Language, Gender and Sexuality: Current Issues and New Directions, Applied Linguistics, 26(4), pp. 482-502 8. BIBLIOGRAPHY Cameron, D., McAlinden, F., O’Leary, K. 1988. Lakoff in context: the social and linguistic function of tag questions, in Coates, J. & D. Cameron (eds.), Women in their speech communities. London: Longman, 74-93. Allan, K., Coltrane, S. 1996. Gender Displaying Television Commercials: A Comparative Study of Television Commercials in the 1950s and the 1980s, Sex Roles, 35(3/4), pp. 185-203 Chamberlain, J., Poesio, M., Kruschwitz, U. 2008. Phrase Detectives: A Web-based Collaborative Annotation Game, [online] Available at <http://dces.essex.ac.uk/staff/udo/papers/phrasedetective s.pdf> [Accessed 26 th May 2011]. Atkins, S., Clear, J., Ostler, N. 1992. Corpus Design Criteria, Literary & Linguist Computing, 7(1), pp. 1-16 Baker, P., Hardie, A., McEnery, T., Xiao, R. Bontcheva, K., Cunningham, H., Gaizauskas, R., Hamza, O., Maynard, D., Tablan, V., Ursu, C. Jayaram, B. D., Leisher, M. 2004. Corpus Linguistics and South Asian Languages: Corpus Creation and Tool Development, Literary and Linguist Computing, 19(4), pp. 509-524. Cheng, W. 2004. Some Preliminary Findings from a Corpus of Spoken Public Discourses in Hong Kong, Language and Computers, 18, pp. 35-52 Baker, P., Lie, M., McEnery, T., Sebba, M. 2000.The Construction of a Corpus of Spoken Sylheti, Literary and Linguistic Computing, 15(4), pp.421-431 Chicago Tribune. 2009. Friends finale is decade's mostwatched TV show [online] Available at <http://articles.chicagotribune.com/2009-1204/entertainment/0912030239_1_joe-millionaire-seriesfinale-grey-s-anatomy> [Accessed 23rd March 2011] BBC, 2001, Anne Robinson: TV's rudest woman?, [online] <Available at: http://news.bbc.co.uk/1/hi/uk/1123974.stm rel="nofollow"> [Accessed 17th March 2011] Chomsky, N. 1962. A transformational approach to syntax. In Archibald Hill (ed.), Proceedings of the third Texas conference on problems of linguistic analysis in English. Austin: University of Texas, pp. 124–58. Beattie, G. W. 1982. Turn-taking and interruption in political interviews: Margaret Thatcher and Jim Callaghan compared and contrasted, Semiotica 39(1/2), pp. 93-114. Chomsky, N. 1964. The Development of Grammar in Child Language: Discussion, Monographs of the Society for Research in Child Development, 29(1), pp. 35-42. Biber, D., Conrad, S., Cortes, V. 2004. If you look at...: Lexical Bundles in University Teaching and Textbooks, Applied Linguistics 25(3), pp. 371-405 Chomsky, N. 1965. Aspects of the Theory of Syntax, Mass: MIT Press. Biber, D., Conrad, S., Reppen, R. 2006. Corpus Linguistics: Investigating Language Structure and Use, Fifth Edition, Cambridge University Press Cixous, H. 1975. ‘Sorties.’ In H. Cixous and C. Clément (eds) La Jeune Née. Paris: Union Générale d’Editions, English translation in E. Marks and I. de Courtivron (eds) (1980) New French Feminisms: An Anthology. Amherst, MA: University of Massachussetts Press, pp. 90–98. Biber, D. 1993. Representativeness in Corpus Design, Literary and Linguistic Computing, 8(4), pp. 243-257 Coates, J. 1986. Women, men and language: Sociolinguistic Account of Sex Differences in Language. Blythe, H., Sweet, C. 1983. Using Media to Teach English, Instructional Innovator, 28(6), pp.22-24. A Coates, J. 1996. Women Talk: Conversation between Women Friends, Blackwell. Boulis, C., Ostendorf, M. 2005. A quantitative analysis of lexical differences between genders in telephone conversations, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 435-442 Coates, J. 2004. Women, Men and Language, Pearson Longman, 3rd Edition. Cook, G. 1990. Transcribing infinity: Problems of context presentation, Journal of Pragmatics, 14(1), pp.1-24. Browne, B. A. 1998. Gender Stereotypes in Advertising on Children's Television in the 1990s: A Cross -National Analysis, Journal of Advertising, 27(1), pp. 83-96. Brown, P. Levinson, S. 1987. Politeness: Some Universals in Language Usage, Cambridge University Press. Cook, G. 1995. Theoretical Issues: Transcribing the Untranscribable. In Leech, G. N., Myers, G., Thomas, J (Eds.). Spoken English on Computer: Transcription, MarkUp, and Application, pp. 35-53. Bucholtz, M. 2004. Foreword, In (Ed.) R. Lakoff, Language and Woman's Place: Text and Commentaries, pp. 5-16. Coulthard, M., Montgomery, M., M. 1981. Developing the description, Studies in Discourse Analysis, pp. 13–30. Bucholtz, M. 2000. The Politics of Transcription, Journal of Pragmatics 32, pp. 1439-1465. Coulthard, M. 1995. The significance of intonation in discourse, In: M. Coultard, ed. 1995. Advances in Spoken Discourse Analysis, Routledge, Chapter 2. Burke, P. J., Tully, J. C. 1977. The Measurement of Role Identity, Social Forces, pp. 881-897. CNN. 1998. President Bill Clinton, [online] Available at <http://edition.cnn.com/ALLPOLITICS/1998/08/17/speech/ transcript.html> [Accessed 23 rd March 2011] Crowdy, S. 1993. Spoken Corpus Transcription, Literary & Linguistic Computing, 9(1), pp. 25-28. Burnard, 1996, How to build a corpus [online] <Available at: http://users.ox.ac.uk/~lou/wip/Boston/howto.htm rel="nofollow"> [Accessed 27th February 2011]. Casey, N., Casey, B., Lewis, J., Calvert, B., French, L. 2008. Television Studies: The Key Concepts, Routledge, 2nd Edition. Crowdy, S. 1994. Spoken Corpus Transcription, Literary & Linguistic Computing, 9(1), pp. 25-28. 77 Anglia Ruskin University MA Applied Linguistics with TESOL Crowdy, S. 1995. The BNC spoken corpus. In G. Leech, G. Myers, J. Thomas (eds.) Spoken English on Computer, Chapter 19, pp. 224-234. Francis, G., Sinclair, J. 1994. 'I Bet He Drinks Carling Black Label': A Riposte to Owen on Corpus Grammar, Applied Linguistics, 15(2), pp. 190-200. Crystal, D. 2009. A dictionary of Language, 2nd Revised Edition, University of Chicago Press. Fraser, B. 1990. An approach to discourse markers. Journal of Pragmatics 14(3), pp. 383–95. Cutting, J. 2002. Pragmatics and Discourse: A resource book for students. London: Routledge. Fraser, B. 1999. What are discourse markers? Journal of Pragmatics, 31(7), pp. 931-952. de Klerk, V. 1992. How Taboo Are Taboo Words for Girls? Language in Society, 21(2), pp. 277-289. Freed, A. F., Greenwood, A. 1996. Women, Men, and Type of Talk: What Makes the Difference? Language in Society, 25(1), pp. 1-26. del-Teso-Craviotto, M. 2006. Words that matter: Lexical choice and gender ideologies in women’s magazines, Journal of Pragmatics, 38, pp. 2003–2021. Furnham, A., Bitar, N. 1993. The stereotyped portrayal of men and women in British television advertisements, Sex Roles, 29(3-4), pp.297-310. Demme, J. E. J. 2009. Charmed and chattering tongues: Investigating the functions and effects of key word clusters in the dialogue of Shakespeare's female characters, [online] Available at <http://www.lexically.net/wordsmith/corpus_linguistics_lin ks/Jane%20Demmen_MA_word%20clusters_in_Shakespear e%27s%20plays.pdf> [Accessed 29th March 2011] Gao, G. 2008. Taboo Language in Sex and the City: An Analysis of Gender Differences in using Taboo Language in Conversation [online] Available at <http://hkr.diva portal.org/smash/get/diva2:224602/FULLTEXT01> [Accessed 29th March 2011]. Garside, R.,Leech, G., Sampson, G. 1987. The computational analysis of English: A corpus-based approach, Longman. Deuchar, M. 1988. A pragmatic account of women's use of standard speech, in Coates, J. & D. Cameron (eds.), Women in their speech communities. London: Longman, 27-32 Garside, R., Leech, G. McEnery A. 1997. Corpus annotation: Linguistic information From Computer Text Corpora, Longman. Drass, K. A. 1986. The Effect of Gender Identity on Conversation, Social Psychology Quarterly, 49(4), pp. 294301. Glascock, J. 2001. Gender Roles on Prime-Time Network Television: Demographics and Behaviours, Journal of Broadcasting & Electronic Media, pp. 656-669. DuBois B. L., Crouch, I. 1975. The question of tag questions in women's speech: they don't really use more of them, do they? Language in Society 4, pp. 289-294. DuBois, J. W. 1991. Transcription Design Principles for Spoken Discourse Research, Pragmatics, 1(1), pp. 71-106. Gibson, E. K. 2009, Would you like manners with that? A study of gender, polite questions and the fast food industry, Griffith Working Papers in Pragmatics and Intercultural Communication 2(1), pp.1-17 Eckert, P. 1989. The whole woman: Sex and gender differences in variation, Language Variation and Change, 1, pp. 245-267. Ginsburg, D. 2004. Friends Ratings [online] Available at <http://newmusicandmore.tripod.com/friendsratings.html > [Accessed 23rd March 2011] Eckert, P., McConnell-Ginet, S. 1999. New generalizations and explanations in language and gender research, Language in Society 28, pp. 185–201. Goffman. E. 1955. On Face-work: An Analysis of Ritual Elements of Social Interaction, Psychiatry: Journal of the Study of Interpersonal Processes, 18(2), pp. 213-231. Eisikovits, E. 1991. Variation in subject-verb agreement in Inner Sydney English, In J. Cheshire (ed.) English Around the World: Sociolinguistic Perspectives, Chapter 16, pp. 235255. Green, J., Franquiz, M., Dixon, C. 1997. The Myth of the Objective Transcript: Transcribing as a Situated Act, TESOL Quarterly, 31(1), pp. 172-176 Google. 2011. RECAPTCHA Frequently Asked Questions, Available at <http://www.google.com/recaptcha/fa q> [Accessed 24th May 2011] Ervin-Tripp, S. 2000. Methods for studying language production, In Menn, L., Ratner, N.B. (Eds), Methods for Studying Language Production, pp. 271-290. Halliday, M. A. K. 1985. An Introduction to Functional Grammar. London: Edward Arnold. Farris, C. S. P. 2000. Cross-sex peer conflict and the discursive production of gender in a Chinese preschool in Taiwan, Journal of Pragmatics, 32(5), pp. 539-568. Halliday, M. A. K. 1985b. Dimensions of discourse analysis: grammar. Teun A. van Dijk (ed.), Handbook of Discourse Analysis. New York: Academic Press. Fisher, D. A., Hill, D. L., Grube, J. W., Gruber, E. L. 2007. Gay, Lesbian, and Bisexual Content on Television: A Quantitative Analysis across Two Seasons, Journal of Homosexuality, 52(3-4), pp. 167–188. Halliday, M. A. K., Hasan, R. 1976. Cohesion in English, Longman. Fishman, P. M. 1978. Interaction: The Work Women Do, Social Problems, 25(4), pp. 397-406. Hepburn, A. 2004. Crying: Notes on Description, Transcription, and Interaction, Research on Language and Social Interaction, 37(3), pp. 251–290. Francis, G., Hunston, S. 2002. Analysing everyday conversation. In R.M. Coulthard (ed.) Advances in Spoken Discourse Analysis, London: Routledge, pp. 123–161. Holmes, J. 1983. The functions of tag questions, English Language Research Journal, 3, pp. 40-65. 78 Anglia Ruskin University MA Applied Linguistics with TESOL Holmes, J. 1986. Functions of You Know in Women's and Men's Speech, Language in Society, 15(1), pp. 1-21. Lauzen, M. M., Dozier, D. M. 1999. Making a difference in prime time: Women on screen and behind the scenes in the 1995-96 television season. Journal of Broadcasting & Electronic Media, 43(1), 1-19. Holmes, J. 1990. Hedges and Boosters in Women's and Men's Speech, Language & Communication. 10(3), pp. 185205. Leech, G. 1983. Principles of Pragmatics, Longman. Holmes, J. 1995. Women, Men and Politeness, Pearson Longman. Leech, G. N., Myers, G., Thomas, J. 1995. Spoken English on Computer: Transcription, Mark-Up, and Application, Longman. Hughes, S. E. 1992. Expletives of lower working-class women, Language in Society 21, pp. 291-303. Leech, G. 1998. Learner corpora: What they are and what can be done with them, in S. Granger (ed.), Learner English on computer, London: Addison Wesley Longman, xiv-xx. Hymes, D. H. 1974. Foundations in Sociolinguistics. University of Pennsylvania Press, Philadelphia. Lippi-Green, R. 1997. English with an accent: Language, ideology, and discrimination in the United States, Routledge. Jaworski, A., Ylanne-McEwen, V., Thurlow, C., Lawson, S. 2003. Social roles and negotiation of status in host tourist interaction: A view from British television holiday programmes, Journal of Sociolinguistics, 7(2), pp. 135-163. Jay, T. 2009. The Utility and Ubiquity of Taboo Words, Perspectives on Psychological Science, 4(2), pp. 153-161. Longman. 2011. Longman Dictionary of Contemporary English [online] Available at: http://www.ldoceonline.com/dictionary/pretty_2 [Accessed 27th May 2011]. Jucker, A. 1993. The discourse marker well: A relevancetheoretical account, Journal of Pragmatics 19, pp. 435-452. Livia, A. 2004. Picking up the gauntlet, in M. Bucholtz, R. Lakoff (eds.) Language and Woman’s Place, Chapter 4. Kay P., Kempton, W. 1984. What Is the Sapir-Whorf Hypothesis? American Anthropologist, New Series, 86(1), pp. 65-79 Macaulay, M. 2001. Tough talk: Indirectness and gender in requests for information, Journal of Pragmatics, 33, pp. 293-316 Kaye, P. 1989a. Laughter, ladies, and linguistics—a lighthearted quiz for language-lovers and language-learners, ELT Journal, 43(3), pp. 185-191. Macaulay, R. 1978. Variation and Consistency in Glaswegian English, In (ed.) P. Trudgill, Sociolinguistic Patterns in British English, Edward Arnold London, pp. 132-143. Kaye, P. 1989b. 'Women are alcoholics and drug addicts', says dictionary, ELT Journal, 43(3), pp. 192-195. Macaulay, R. 2002. You know, it depends, Journal of Pragmatics 34, pp. 749–767 Kiesling, S. F. 2004. What Does a Focus on "Men's Language" Tell Us about Language and Woman's Place? In R. T. Lakoff, Language and Woman's Place: Text and Commentaries, Chapter 16. Oxford University Press , pp. 229-236. MacFadden, K., Barrett, K., Horst, M. 2009. What’s in a Television Word List? A Corpus-Informed Investigation, Concordia Working Papers in Applied Linguistics, 2, pp. 7898. McCarthy, M. 2000. Discourse Analysis for Language Teachers, Tenth Edition, Cambridge University Press . Labov, W. 2006. The Social Stratification of English in New York City, Cambridge University Press, Second Edition. McCarthy, M. 2006. Explorations in Corpus Linguistics, Cambridge University Press. Labov, W., Fanshel, D. 1977. Therapeutic Discourse, New York: Academic Press. McConnell-Ginet, S. 2003. What's in a Name? Social Labelling and Gender Practices, In J. Holmes, M. Meyerhoff (eds), The Handbook of Language and Gender, Chapter 3 Lakoff, G. 1973. Hedges: A Study in Meaning Criteria and the Logic of Fuzzy Concepts, Journal of Philosophical Logic, 2 pp. 458-508. McEnery, A., Xiao, R., Tono, Y. 2006. Corpus -based language Studies: An Advanced Resource Book, London: Routledge. Lakoff, R. T. 1973. Language and Woman's Place, Language in Society, 2(1), pp. 45-80 Meyer, C. F. 2004. English Corpus Linguistics: An Introduction, Cambridge University Press Lakoff, R. T. 1973b. Questionable answers and answerable questions. In: B. Kachru, R.B. Lees, Y. Malkiel, A. Pietrangeli, S. Saporta,(Eds)., Issues in linguistics. Papers in honor of Henry and Rente Kahane, University of Illinois Press. Mills, S. (2003). Gender and Politeness. Cambridge: Cambridge University Press Lakoff, R. T. 2001. The Language War, University of California Press Nelson, G. 1995. The International Corpus of English: markup for spoken language. In G. Leech, G. Myers, J. Thomas (eds.) Spoken English on Computer, Chapter 18, pp. 220-223 Lakoff, R. T. 2001b, Nine Ways of Looking at Apologies: The Necessity for Interdisciplinary Theory and Method in Discourse Analysis, In D. Schiffrin, D. Tannen, H. E. Hamilton (eds.) The Handbook of Discourse Analysis, Blackwell, Chapter 10. O’Barr, W. M., Atkins, K. B. 1997. Women's language or Powerless Language? , In: (ed) J. Coates, Language and gender: A Reader, pp. 377-387. Lakoff, R.T. 2004. Language and Woman’s Place, Oxford University Press, 2nd Edition. O'Keeffe, A., McCarthy, M., Carter, R. 2007. From Corpus to Classroom, Cambridge University Press. 79 Anglia Ruskin University MA Applied Linguistics with TESOL Ochs, E. 1999. Transcription as theory. In A. Jaworski & N. Coupland (Eds.), The Discourse Reader, pp. 167 182, London; New York: Routledge. Sunderland, J. 2006. Language and Gender: An advanced resource book, Routledge. Swan, M. 2005. Practical English Usage, Third Edition, Oxford University Press. Owen, C. 1993. Corpus-Based Grammar and the Heineken Effect: Lexico-grammatical Description for Language Learners, Applied Linguistics, 14(2), pp. 167-187. Sweney, M. 2010. Britons 'watch four hours of TV a day', Online Available at <http://www.guardian.co.uk/media/2010/may/04/thinkbo x-television-viewing> [Accessed 12th April 2011]. Pawley, A., Syder, F. H. 1983. Two puzzles for linguistic theory: Nativelike selection and nativelike fluency. In J. C. Richards, R. W. Schmidt (Eds.), Language and communication. London ; New York: Longman. pp. 191-226 Tagliamonte, S. 1998. Was/were variation across the generations: View from the city of York, Language Variation and Change, 10, pp. 153-191. Pilkington, J. 1992. Don't try and make out that I'm nice! The different strategies women and men use when gossiping, WWPIL, 5, pp. 37-60. Tagliamonte, S., Roberts, C. 2005. So Weird; So Cool; So Innovative: The Use of Intensifiers in the Television Series Friends, American Speech, 80(3), pp. 280-300. Pullum, G. K. 1989. The great Eskimo vocabulary hoax, Natural Language & Linguistic Theory, 7(2), pp. 275-281. Roberts, C. 1997. Transcribing Talk: Issues Representation, TESOL Quarterly, 31(1), pp. 167-172. Tannen, D. 1990. You Just Don't Understand: Women and Men in Conversation, Virago Press Ltd. of Tannen, D. 1994. Gender and Discourse, Oxford University Press. Sacks, S. Schegloff, E. A. Jefferson, G. 1974. A Simplest Systematics for the Organization of Turn-Taking for Conversation, Language, 50(4), pp. 696-735. The ICE Project. 2009. Available ICE Corpora @ ICEcorpora.net [online] Available at: http://icecorpora.net/ice/avail.htm [Accessed 27th May 2011] Sauntson, H. 2007. Girls' and Boys' Use of Acknowledging Moves in Pupil Group Classroom Discussions, Language and Education, 21(4), pp. 304-327. The Sun. 2011. Bug-hit Anne is so Weak, [online] Available at <http://www.thesun.co.uk/sol/homepage/showbiz/tv/345 7444/Weakest-Links-Anne-Robinson-takes-first-day-offsick-in-ten-years.html> [Accessed 23rd March 2011] Schegloff, E. A. 1968. Sequencing in Conversational Openings, American Anthropologist, New Series, 70(6), pp. 1075-1095. Schegloff, E. A., Sacks, H. 1973. Opening up Closings, Semiotica, 8, pp. 289–327. Tottie, G. 1991. Conversational style in British and American English: The case of backchannels. In K. Aijmer & B. Altenberg (Eds.), English corpus linguistics: Studies in honour of Jan Svartvik (pp. 254–271). London: Longman Schiffrin, D. 1985. Conversational Coherence: The Role of Well, Language, 61(3), pp. 640-667. Schiffrin, D. 1987. Discourse Markers, Cambridge University Press. Tottie, G., Hoffmann, S. 2006. Tag Questions in British and American English, Journal of English Linguistics, 34(4), pp. 283-311 Schiffrin, D. 2001. Discourse Markers: Language, Meaning, and Context, In D. Schiffrin, D. Tannen, H. E. Hamilton (eds.) The Handbook of Discourse Analysis, Blackwell, Chapter 3. Trappes-Lomax, 2004. Discourse Analysis, in (eds. Davies, A., Elder, C.) The Handbook of Applied Linguistics, Blackwell Handbooks in Linguistics, Chapter 5, pp. 133-164 Simpson, P. 2001. ‘Reason’ and ‘tickle’ as pragmatic constructs in the discourse of advertising, Journal of Pragmatics 33, pp. 589-607. Trudgill, P. 1972. Sex, covert prestige and linguistic change in the urban British English of Norwich, Language in Society, 1(2), pp. 179-195 Sinclair, J., Coulthard, R. M. 1975. Towards an Analysis of Discourse, Oxford University Press. Trudgill, P. 1983. Sociolinguistics: An introduction to language and society. London: Pelican. Sinclair, J. 1992. The automatic analysis of text corpora, In J. Svartvik (ed.) Directions in Corpus Linguistics: Proceedings of the Nobel Symposium 82, Stockholm, pp. 379-397, The Hague: Mouton. UCREL. 2000. POS-tagging Error Rates [online] Available at: <http://ucrel.lancs.ac.uk/bnc2/bnc2error.htm> [Accessed 27th May 2011]. von Ahn, L., Maurer, B., McMillen, C., Abraham, D., Blum, M. 2008. reCAPTCHA: Human-Based Character Recognition via Web Security Measures, Science, 321, pp. 1465-1468 Smith, J. S. 1992. Women in Charge: Politeness and Directives in the Speech of Japanese Women, Language in Society, 21(1), pp. 59-82. Wikipedia. 2011. The Weakest Link [online] Available at: http://en.wikipedia.org/wiki/The_Weakest_Link [Accessed 27th May 2011] Sommers, C. H. 2001. The War Against Boys - How Misguided Feminism Is Harming Our Young Men, American Experiment Quarterly, pp.26-36. Wikipedia, 2011b. Courney Cox [online[ Available at: http://en.wikipedia.org/wiki/Courteney_Cox {Accessed 27th May 2011] Sommers-Flanagan, R., Sommers-Flanagan, J., Davis, B. 1993. What's happening on Music Television? A gender role content analysis, Sex Roles, 28(11-12), pp. 745-753. Stubbs, M. 1996. Texts and Corpus Analysis. Oxford: Blackwell 80 Anglia Ruskin University MA Applied Linguistics with TESOL Witt, A., Rehm, G., Hinrichs, E., Lehmberg, T., Stegmann, J. 2009. SusTEInability of linguistic resources through feature structures, Literary and Linguistic Computing, 24(3), pp. 363-372. Wolfson, N. 1988. The bulge: a theory of speech behaviour and social distance. In J. Fine (ed.) Second Language Discourse: A textbook of Current Research, Norwood, N.J.: Ablex Zimmerman, D. H., West, C. 1975, Sex Roles, Interruptions and Silences in Conversation, In B. Thorne, N. Henley (eds.) Language and Sex: Difference and Dominance, pp. 105-12 81 9. APPENDIX 9.1 QUERY FOR RETRIEVING TAG QUESTIONS select gender, scenes.scene_interaction as si, count(id), len, ROUND((count(id)/len)*1000,2) as per_thousand # id, person, gender, line from friends, scenes, ( select distinct scene_interaction, sum(scene_length) as len from scenes group by 1 order by 1 ) inlineA where (metadata regexp ".*(VBDZ|VBDR|VBG|VBM|VBN|VBR|VBZ|VD0|VDD|VDI|VDN|VDZ|VHO|VHZ|VHG|VHI|VHD|VM).{1,15}(XX. {1,15})(PPY|PPHS1|PPHS2|PPIS2).{0,5}[[.question-mark.]]” or metadata regexp ".*(VBDZ|VBDR|VBG|VBM|VBN|VBR|VBZ|VD0|VDD|VDI|VDN|VDZ|VHO|VHZ|VHG|VHI|VHD|VM).{1,15}(XX) .*( VBDZ|VBDR|VBG|VBM|VBN|VBR|VBZ|VD0|VDD|VDI|VDN|VDZ|VHO|VHZ|VHG|VHI|VHD|VM).{1,15}( PPY|PPHS1|PPHS2|PPIS2).{0,5}[[.question-mark.]]” or metadata regexp ".*( right_RR ).{1,15}[[.question-mark.]]”) and friends.scene_id = scenes.scene_id and scenes.scene_interaction = inlineA.scene_interaction and gender in ("M", "F") and scenes.scene_interaction in (“M", “F”, “M,F”) group by 1, 2 order by 1, 2 into outfile "c:\\tag-questions.txt" lines terminated by "\r\n"; 9.2 QUERY FOR RETRIEVING THE BREAKDOWN OF TAG QUESTIONS select a.gender, a.si, a.isnt_it, ROUND(a.isnt_it/inlineA.total_length*1000,2) as isnt_it_pert, b.is_it, ROUND(b.is_it/inlineA.total_length*1000,2) as is_it_pert, c.rght, ROUND(c.rght/inlineA.total_length*1000,2) as right_pert, inlineA.total_length as len from ( # +‟ve, „-ve select gender, scenes.scene_interaction as si, count(id) as isnt_it from friends, scenes where metadata regexp ".*(VBDZ|VBDR|VBG|VBM|VBN|VBR|VBZ|VD0|VDD|VDI|VDN|VDZ|VHO|VHZ|VHG|VHI|VHD|VM).*(V BDZ|VBDR|VBG|VBM|VBN|VBR|VBZ|VD0|VDD|VDI|VDN|VDZ|VHO|VHZ|VHG|VHI|VHD|VM).{1,15}(X X.{1,15})(PPY|PPHS1|PPHS2|PPIS2).{0,5}[[.question-mark.]]” and friends.scene_id = scenes.scene_id group by 1, 2 order by 1, 2 ) a, ( # -„ve, +‟ve select gender, scenes.scene_interaction as si, count(id) as is_it from friends, scenes where metadata regexp ".*(VBDZ|VBDR|VBG|VBM|VBN|VBR|VBZ|VD0|VDD|VDI|VDN|VDZ|VHO|VHZ|VHG|VHI|VHD|VM).{1, 15}(XX).*(VBDZ|VBDR|VBG|VBM|VBN|VBR|VBZ|VD0|VDD|VDI|VDN|VDZ|VHO|VHZ|VHG|VHI|VHD|V M).{1,15}(PPY|PPHS1|PPHS2|PPIS2).{0,5}[[.question-mark.]]” and friends.scene_id = scenes.scene_id group by 1, 2 order by 1, 2 ) b, ( # -'ve + right? select gender, scenes.scene_interaction as si, count(id) as rght from friends, scenes where metadata regexp ".*( right_RR )[[.question -mark.]]” and friends.scene_id = scenes.scene_id group by 1, 2 order by 1, 2 ) c, ( A1 Anglia Ruskin University MA Applied Linguistics with TESOL select distinct scene_interaction, SUM(scene_length) as total_length from scenes group by 1 order by 1 ) inlineA where a.gender = b.gender and a.gender = c.gender and a.si = b.si and a.si = c.si and a.si in (“M", “F”, “M,F”) and a.si = inlineA.scene_interaction group by 1, 2 order by 1, 2 into outfile "c:\\tag-questions-breakdown.txt" lines terminated by "\r\n"; 9.3 QUERY FOR RETRIEVING THE NUMBER OF QUESTIONS ASKED select gender, scenes.scene_interaction, COUNT(line) as raw_count, ROUND(COUNT(line)/total_length*1000,2) as per_thousand from friends, scenes, ( select distinct scene_interaction, SUM(scene_length) as total_length from scenes group by 1 order by 1 ) inlineA where metadata regexp ".*( [[.question-mark.]]_[[.question-mark.]]).*" questions asked and friends.scene_id = scenes.scene_id and gender in (“M”, “F”) # exclude “U” and scenes.scene_interaction = inlineA.scene_interaction and scenes.scene_interaction in (“M”, “F”, “M,F”) group by 1, 2 order by 1, 2 9.4 # QUERY FOR RETRIEVING THE STATISTICS FOR HEDGES select gender, scenes.scene_interaction, COUNT(line) as raw_count, ROUND(COUNT(line)/total_length*1000,2) as per_thousand from friends, scenes, ( select distinct scene_interaction, SUM(scene_length) as total_length from scenes group by 1 order by 1 ) inlineA where metadata regexp "^Well_RR.*" # hedges ("Well,...") and friends.scene_id = scenes.scene_id and gender in (“M”, “F”) # exclude “U” and scenes.scene_interaction = inlineA.scene_interaction and scenes.scene_interaction in (“M”, “F”, “M,F”) group by 1, 2 order by 1, 2 where metadata regexp where metadata regexp where metadata regexp where metadata regexp where metadata regexp where metadata regexp help me please?") where metadata regexp of/sorta...") where metadata regexp 9.5 ".*I_PPIS1 think_VV0.*" # hedges (“I think...”) ".*I_PPIS1 guess_VV0.*" # hedges (“I guess...”) ".*I_PPIS1 wonder_VV0.*" # hedges (“I wonder...”) ".*I_PPIS1 'm_VBM sure_JJ.*" # hedges (“I‟m sure...”) ".*(Ya know|Y‟know)" # hedges (“y‟know….”) ".*_VM.{1,10}PPY.*please_RR.{1,5}[[.?.]]" # hedges ("Could you ".*(sorta_NN1|sort_RR21 of_RR22).*" # hedges ("...sort "^Well_RR.*" # hedges ("Well,...") QUERY FOR RETRIEVING THE MOST FREQUENT ADJECTIVES # adjectives select gender, UPPER(word) as word, 2 Anglia Ruskin University MA Applied Linguistics with TESOL COUNT(word) as count_word from words where word like "%_JJ" group by 1, 2 having COUNT(word) > 100 # only interested in the most frequent order by 1, 3 desc # highest first 9.6 QUERY FOR RETRIEVING THE MOST FREQUENT NOUNS select gender, UPPER(word) as word, COUNT(only_word) as count_word from words where word regexp ".*_(ND1|NN|NN1|NN2|NNA|NNB|NNL1|NNL2|NNO|NNO2|NNT1|NNT2|NNU|NNU1|NNU2|NP|NP1|NP2|NPD1|N PD2|NPM1|NPM2)" and UPPER(only_word) not in ( select person from friends ) # exclude nouns which are names group by 1, 2 having COUNT(word) > 100 # only interested in the most frequent order by 1, 3 desc # highest first 9.7 QUERY FOR RETRIEVING THE MOST FREQUENT VERBS select gender, UPPER(word) as word, COUNT(only_word) as count_word from words where word regexp ".*_(VB0|VBDR|VBDZ|VBG|VBI|VBM|VBN|VBR|VBZ|VD0|VDD|VDG|VDI|VDN|VDZ|VH0|VHD|VHG|VHI|VHN|V HZ|VM|VMK|VV0|VVD|VVG|VVGK|VVI|VVN|VVNK|VVZ)" group by 1, 2 having COUNT(word) > 100 # only interested in the most frequent order by 1, 3 desc # highest first 9.8 QUERY FOR RETRIEVING THE AVERAGE UTTERANCE LENGTH # averages utterance length select gender, scene_interaction, ROUND(AVG(number_of_words),2) as average_number_of_words from ( select scene_id, gender, LENGTH(line)-LENGTH(REPLACE(line, “ “, “”))+1 as number_of_words from friends order by 3 limit 3042, 57805 # 5% trimmed (remove this line for full average) ) inline_table, scenes where inline_table.scene_id = scenes.scene_id and gender in ("M", "F") group by 1, 2 order by 1, 2 9.9 QUERY FOR RETRIEVING ALL SINGLE SEX WORDS # words females use but males select UPPER(only_word), COUNT(only_word) from words where gender = "M" # performance killer and only_word not in ( select # performance killer and UPPER(only_word) not in ( group by 1 having COUNT(word) > 3 order by 2; don't # use UPPER() so “Good” and “good” are grouped together only_word from words where gender = "F") select distinct person from friends) # arbitrary, only report frequent unique words # adjectives females use but males don't select UPPER(word), COUNT(word) # use UPPER() so “Good” and “good” are grouped together (as “GOOD”) from words where gender = "?" and word like "%_JJ" # performance killer and word not in ( select word from words where gender = "not ?" and word like "%_JJ" ) 3 Anglia Ruskin University MA Applied Linguistics with TESOL group by 1 having COUNT(word) > 3 order by 2; 9.10 QUERY FOR RETRIEVING THE DIACHRONIC USE OF ‘REALLY’, ‘VERY’, ’SO’ select inlineA.season as season, ROUND((inlineA.count_of_variable/inlineC.total_count)*1000,2) as M_per_thousand, ROUND((inlineB.count_of_variable/inlineD.total_count)*1000,2) as F_per_thousand from ( # get number of utterances featuring the pattern so + adj. for females select left(filename, 2) as season, count(line) as count_of_variable from friends, scenes where (metadata regexp ".*so_RG .*_JJ") and gender = "M" and scenes.scene_interaction in ("M") and friends.scene_id = scenes.scene_id group by left(filename, 2) order by 1, 2 ) inlineA, ( # get number of utterances featuring the pattern so + adj. for females select left(filename, 2) as season, count(line) as count_of_variable from friends, scenes where (metadata regexp ".*so_RG .*_JJ") and gender = "F" and scenes.scene_interaction in ("F") and friends.scene_id = scenes.scene_id group by left(filename, 2) order by 1, 2 ) inlineB, ( # get total utterances for males (all scenes) select left(filename, 2) as season, count(line) as total_count from friends, scenes where scenes.scene_interaction in ("M") and gender = "M" and friends.scene_id = scenes.scene_id group by 1 ) inlineC, ( # get total utterances for males (all scenes) select left(filename, 2) as season, count(line) as total_count from friends, scenes where scenes.scene_interaction in ("F") and gender = "F" and friends.scene_id = scenes.scene_id group by 1 ) inlineD where inlineD.season = inlineA.season and inlineD.season = inlineB.season and inlineD.season = inlineC.season; Frequency Counts -Simple # get number of utterances featuring the pattern so + adj. for females select gender, scene_interaction, count(line) as count_of_variable from friends, scenes where (metadata regexp ".*so_RG .*_JJ") and friends.scene_id = scenes.scene_id and gender in (“M”, “F”) and scene_interaction in (“M”, “F”, “M,F”) group by 1, 2 order by 1, 2; Frequency Counts - Complex select inlineA.gender, inlineA.scene_interaction as SI, ROUND((inlineA.count_of_variable/inlineB.total_count)*1000,2) as AA from ( # get number of utterances featuring the pattern so + adj. for females Select gender, scene_interaction, count(line) as count_of_variable from friends, scenes where metadata regexp ".*`RR .*_JJ" and scenes.scene_interaction in ("M", “F”, “M,F”) 4 Anglia Ruskin University MA Applied Linguistics with TESOL and friends.scene_id = scenes.scene_id group by 1,2 order by 1 ) inlineA, ( # get total utterances for males (all scenes) Select gender, scene_interaction, count(line) as total_count from friends, scenes where scenes.scene_interaction in ("M", "F", "M,F") and friends.scene_id = scenes.scene_id group by 1, 2 order by 1 ) inlineB where inlineA.gender = inlineB.gender and inlineA.scene_interaction = inlineB.scene_interaction order by 1; # so_RG # very_RG # really_RR 9.11 QUERY FOR RETRIEVING THE NUMBER OF EMPTY ADJECTIVES select a.gender, inlineA.scene_interaction as si, empty_adj, ROUND(empty_adj/inlineA.total_length*1000,2) as empty_pt from ( select distinct scene_interaction, SUM(scene_length) as total_length from scenes group by 1 order by 1 ) inlineA, ( select gender, scene_interaction, COUNT(id) as empty_adj from friends, scenes where line regexp ".*(great|cool|gorgeous|wonderful|divine|pretty|lovely|good|fantastic|charming|sweet|ado rable).*" and friends.scene_id = scenes.scene_id group by 1, 2 order by 1, 3 desc ) a where inlineA.scene_interaction in ("M", "F", "M,F") and inlineA.scene_interaction = a.scene_interaction group by 1, 2 order by 1, 2; # replace with: (great|cool|gorgeous|wonderful|divine|pretty|lovely|good|fantastic|charming|sweet|adorab le) 9.12 QUERY FOR RETRIEVING THE NUMBER OF PRONOUN REFERENCES select gender, scenes.scene_interaction, COUNT(line) as raw_count, ROUND(COUNT(line)/total_length*1000,2) as per_thousand from friends, scenes, ( select distinct scene_interaction, SUM(scene_length) as total_length from scenes group by 1 order by 1 ) inlineA where metadata regexp ".*(PPIS2|PPIO2).*" # PPIS1|PPIO1 = I/my PPIS2|PPIO2 = We/our and friends.scene_id = scenes.scene_id and gender in (“M”, “F”) # exclude “U” and scenes.scene_interaction = inlineA.scene_interaction and scenes.scene_interaction in (“M”, “F”, “M,F”) group by 1, 2 order by 1, 2 5 # Anglia Ruskin University MA Applied Linguistics with TESOL 9.13 QUERY FOR RETRIEVING THE NUMBER OF TABOO WORDS General Counts by gender # taboo vocabulary list taken from Jay (2009) select friends.gender, scenes.scene_interaction as si, no_of_utt, round(count(id)/no_of_utt*1000, 2) as per_1000, count(id) from friends, scenes, ( select gender, scene_interaction, count(id) as no_of_utt from friends, scenes where friends.scene_id = scenes.scene_id and friends.gender in ("M", "F") and scenes.scene_interaction in ("M", "F", "M,F") group by 1, 2 order by 1 ) inlineA where (metadata regexp ".* fuck_.*" or metadata regexp ".* shit_.*" or metadata regexp ".* hell_.*" or metadata regexp ".* damn_.*" or metadata regexp ".* goddamn_.*" or metadata regexp ".* Christ_.*" # Jesus Christ or metadata regexp ".* ass_.*" or metadata regexp ".* god_.*" # Oh my god or metadata regexp ".* bitch_.*" or metadata regexp ".* sucks_.*") and friends.gender = inlineA.gender and friends.scene_id = scenes.scene_id and scenes.scene_interaction in ("M", "F", "M,F") and scenes.scene_interaction = inlineA.scene_interaction group by 1, 2, 3 order by per_1000 General Counts by actor # taboo vocabulary list taken from Jay (2009) select friends.person, gender, no_of_utt, round(count(id)/no_of_utt*1000, 2) as per_1000, count(id) from friends, scenes, ( select distinct person, count(id) as no_of_utt from friends where friends.person in ("ROSS", "RACHEL", "MONICA", "JOEY", "CHANDLER", "PHOEBE") group by 1 order by 1 ) inlineA where (metadata regexp ".* fuck_.*" or metadata regexp ".* shit_.*" or metadata regexp ".* hell_.*" or metadata regexp ".* damn_.*" or metadata regexp ".* goddamn_.*" or metadata regexp ".* Christ_.*" # Jesus Christ or metadata regexp ".* ass_.*" or metadata regexp ".* god_.*" # Oh my god or metadata regexp ".* bitch_.*" or metadata regexp ".* sucks_.*") and friends.person in ("ROSS", "RACHEL", "MONICA", "JOEY", "CHANDLER", "PHOEBE") and friends.person = inlineA.person and friends.scene_id = scenes.scene_id and scenes.scene_interaction in ("M", "F", "M,F") group by 1, 2 order by per_1000 into outfile “c:\\taboo-general.txt” lines terminated by “\r\n”; 6 Anglia Ruskin University MA Applied Linguistics with TESOL Counts per actor per scene # !!! this will normalize the frequency based on the actors number of utterances for each scene type !!! # taboo vocabulary list taken from Jay (2009) select friends.person, gender, scenes.scene_interaction as SI, no_of_utt, round(count(id)/no_of_utt*1000, 2) as per_1000, count(id) from friends, scenes, ( select distinct person, scene_interaction, count(id) as no_of_utt from friends, scenes where friends.scene_id = scenes.scene_id and friends.person in ("ROSS", "RACHEL", "M ONICA", "JOEY", "CHANDLER", "PHOEBE") and scene_interaction in ("M", "F", "M,F") group by 1, 2 order by 1 ) inlineA where (metadata regexp ".* fuck_.*" or metadata regexp ".* shit_.*" or metadata regexp ".* hell_.*" or metadata regexp ".* damn_.*" or metadata regexp ".* goddamn_.*" or metadata regexp ".* Christ_.*" # Jesus Christ or metadata regexp ".* ass_.*" or metadata regexp ".* god_.*" # Oh my god or metadata regexp ".* bitch_.*" or metadata regexp ".* sucks_.*") and friends.person in ("ROSS", "RACHEL", "MONICA", "JOEY", "CHANDLER", "PHOEBE") and friends.person = inlineA.person and friends.scene_id = scenes.scene_id and scenes.scene_interaction in ("M", "F", "M,F") and scenes.scene_interaction = inlineA.scene_interaction group by 1, 2, 3 order by per_1000 into outfile “c:\\taboo-general.txt” lines terminated by “\r\n”; Counts per individual taboo word # !!! this will normalize the frequency based on the actors number of utterances for each scene type !!! # taboo vocabulary list taken from Jay (2009) select friends.person, gender, no_of_utt, round(count(id)/no_of_utt*1000, 2) as per_1000, count(id) from friends, ( select distinct person, count(id) as no_of_utt from friends where friends.person in ("ROSS", "RACHEL", "MONICA", "JOEY", "CHANDLER", "PHOEBE") group by 1 order by 1 ) inlineA where metadata regexp ".* god_.*" and friends.person in ("ROSS", "RACHEL", "MONICA", "JOEY", "CHANDLER", "PHOEBE") and friends.person = inlineA.person group by 1, 2 order by per_1000 ".* ".* ".* ".* ".* ".* ".* ".* ".* ".* fuck_.*" shit_.*" hell_.*" damn_.*" goddamn_.*" Christ_.*" # Jesus Christ ass_.*" god_.*" # Oh my god bitch_.*" sucks_.*") 7 Anglia Ruskin University MA Applied Linguistics with TESOL 9.14 QUERY FOR RETRIEVING THE COLOURS USED BY THE GENDERS # word list was taken from Wikipedia: http://en.wikipedia. org/wiki/List_of_colors select distinct lower(word), gender, count(word) from words where only_word in (“Anti-flash white”, “Beige”, “Cosmic latte”, “Cream”, “Eggshell”, “Ghost white”, “Isabelline”, “Ivory”, “Magnolia”, “Old lace”, ““, “Pearl”, “Seashell”, “Splashed white”, “Vanilla”, “White”, “Amaranth”, “Amaranth pink”, “Brink pink”, “Carmine pink”, “Carnation pink”, “Cerise”, “Cerise pink”, “Cherry blossom pink”, “Coral pink”, “Dark pink”, “Deep carmine pink”, “Deep pink”, “Fandango”, “French rose”, “Fuchsia”, “Fuchsia pink”, “Hollywood cerise”, “Hot magenta”, “Hot pink”, “Lavender pink”, “Light pink”, “Light thulian pink”, “Magenta”, “Mountbatten pink”, “Nadeshiko pink”, “Persian pink”, “Persian rose”, “Pink”, “Puce”, “Rose”, “Rose pink”, “Ruby”, “Salmon pink”, “Shocking pink”, “Tea rose”, “Thulian pink”, “Ultra pink”, “Variations of pink”, “Alizarin crimson”, “Amaranth”, “American Rose”, “Auburn”, “Burgundy”, “Burntsienna”, “Candy apple red”, “Cardinal”, “Carmine”, “Carnelian”, “Cerise”, “Chestnut”, “Coquelicot”, “Coral red”, “Crimson”, “Dark pink”, “Falu red”, “Fire brick”, “Fire engine red”, “Flame”, “Fuchsia”, “Lava”, “Lust”, “Magenta”, “Maroon”, “Mauve”, “Mauve taupe”, “Orange-red”, “Persian red”, “Persimmon”, “Pink”, “Raspberry”, “Red”, “Redviolet”, “Redwood”, “Rose”, “Rose madder”, “Rosewood”, “Rosso corsa”, “Ruby”, “Rufous”, “Rust”, “Sangria”, “Scarlet”, “Sinopia”, “Terra cotta”, “Tuscan red”, “Upsdell red”, “Venetian red”, “Vermilion”, “Wine”, “Amber”, “Apricot”, “Atomic tangerine”, “Brown”, “Burnt orange”, “Carrot orange”, “Champagne”, “Coral”, “Dark salmon”, “Deep carrot orange”, “ECE/SAE Amber”, “Flame”, “Gamboge”, “Gold”, “Gold (metallic)”, “International orange”, “Mahogany”, “Orange”, “Orange-red”, “Orange peel”, “Papaya whip”, “Peach”, “Peach-orange”, “Peach-yellow”, “Persian orange”, “Persimmon”, “Pink-orange”, “Portland Orange”, “Princeton orange”, “Pumpkin”, “Rust”, “Safety orange”, “Salmon”, “Sunset”, “Tangelo”, “Tangerine”, “Tea rose”, “Tenné”, “Tomato”, “Vermilion”, “Auburn”, “Beige”, “Bistre”, “Bole”, “Bronze”, “Brown”, “Buff”, “Burgundy”, “Burnt sienna”, “Burnt umber”, “Camel”, “Chamoisee”, “Chestnut”, “Chocolate”, “Citrine”, “Copper”, “Cordovan”, “Desert sand”, “Earth yellow”, “Ecru”, “Fallow”, “Fawn”, “Fulvous”, “Isabelline”, “Khaki”, “Liver”, “Mahogany”, “Maroon”, “Ochre”, “Raw umber”, “Redwood”, “Rufous”, “Russet”, “Rust”, “Sandy brown”, “Seal brown”, “Sepia”, “Sienna”, “Sinopia”, “Tan”, “Taupe”, “Tawny”, “Umber”, “Wenge”, “Wheat”, “Amber”, “Apricot”, “Arylide yellow”, “Aureolin”, “Beige”, “Blond”, “Buff”, “Chartreuse yellow”, “Chrome yellow”, “Citrine”, “Cream”, “Dark goldenrod”, “Ecru”, “Flavescent”, “Flax”, “Fulvous”, “Gamboge”, “Gold”, “Gold (metallic)”, “Goldenrod”, “Golden poppy”, “Golden yellow”, “Green-yellow”, “Hansa yellow”, “Icterine”, “Isabelline”, “Jasmine”, “Jonquil”, “Khaki”, “Lemon”, “Lemon chiffon”, “Lime”, “Maize”, “Mikado yellow”, “Mustard”, “Naples yellow”, “Navajo white”, “Old gold”, “Olive”, “Pale gold”, “Papaya whip”, “Peach-yellow”, “Pear”, “Saffron”, “School bus yellow”, “Selective yellow”, “Stil de grain yellow”, “Sunglow”, “Tangerine yellow”, “Titanium yellow”, ““, “Urobilin”, ““, “Vanilla”, “Vegas gold”, “Yellow”, “Gray”, “Arsenic”, “Ash gray”, “Battleship gray”, “Bistre”, “Black”, “Cadet gray”, “Charcoal”, “Cinereous”, “Cool gray”, “Davy's gray”, “Feldgrau”, “Glaucous”, “Isabelline”, “Liver”, “Payne's gray”, “Platinum”, “Seal brown”, “Silver”, “Slate gray”, “Taupe”, “Purple taupe”, “Medium taupe”, “Taupe gray”, “Pale taupe”, “Rose quartz”, “White”, “Xanadu”, “Army green”, “Asparagus”, “Bright green”, “British racing green”, “Cal Poly Pomona green”, “Camouflage green”, “Celadon”, “Chartreuse”, “Clover”, “Dartmouth green”, “Electric green”, “Emerald”, “Fern green”, “Forest green”, “Grayasparagus”, “Green”, “Green-yellow”, “Harlequin”, “Honeydew”, “Hooker's green”, “Hunter green”, “India green”, “Islamic green”, “Jade”, “Jungle green”, “Kelly green”, “Lime”, “Lime green”, “Midnight green”, “Mint cream”, “Moss green”, “MSU Green”, “Myrtle”, “Neon green”, “Office green”, “Olive”, “Olive drab”, “Pakistan green”, “Paris Green”, “Pear”, “Persian green”, “Phthalo green”, “Pigment green”, “Pine green”, “Rifle green”, “Sacramento State green”, “Sap green”, “Sea green”, “Shamrock green”, “Spring bud”, “Spring green”, “Tea green”, “Teal”, “UP Forest green”, “Viridian”, “Yellow-green”, “Variations of green”, “Alice blue”, “Aqua”, “Aquamarine”, “Baby blue”, “Bondi blue”, “Cerulean”, “Cyan”, “Electric blue”, “Midnight green”, “Pine green”, “Robin egg blue”, “Teal”, “Turquoise”, “Verdigris”, “Viridian”, “Air Force blue”, “Alice blue”, “Azure”, “Baby blue”, “Bleu de France”, “Blue”, “Bondi blue”, “Brandeis blue”, “Cambridge Blue”, “Carolina blue”, “Ceil”, “Cerulean”, “Cobalt blue”, “Columbia blue”, “Cornflower blue”, “Cyan”, “Dark blue”, “Deep sky blue”, “Denim”, “Dodger blue”, “Duke blue”, “Egyptian blue”, “Electric blue”, “Eton blue”, “Federal blue”, “Glaucous”, “Han blue”, “Iceberg”, “Indigo”, “International Klein Blue”, “Iris”, “Light blue”, “Majorelle Blue”, “Maya blue”, “Midnight blue”, “Navy blue”, “Non-photo blue”, “Palatinate blue”, “Periwinkle”, “Persian blue”, “Phthalo blue”, “Powder blue”, “Prussian blue”, “Royal blue”, “Sapphire”, “Sky blue”, “Steel blue”, “Teal”, “Tiffany Blue”, “True Blue”, “Tufts Blue”, “Turquoise”, “UCLA Blue”, “Ultramarine”, “Yale Blue”, “Amethyst”, “Byzantium”, “Cerise”, “Eggplant”, “Fandango”, “Fuchsia”, “Han purple”, “Heliotrope”, “Indigo”, “Iris”, “Lavender (floral)”, “Lavender”, “Lavender blush”, “Lilac”, “Magenta”, “Mauve”, “Orchid”, “Palatinate purple”, “Periwinkle”, “Persian blue”, “Purple”, “Red-violet”, “Regalia”, “Rose”, “Sangria”, “Thistle”, “Tyrian purple”, “Violet”, “Wisteria”, “black”, “gray”, “silver”, “white”, “maroon”, “red”, “purple”, “fuchsia”, “green”, “lime”, “olive”, “yellow”, “navy”, “blue”, “teal”, “aqua”) and right(word, 2) = “JJ” # adjectives only please! # and gender = “F” group by 1, 2 order by 3 8 Anglia Ruskin University MA Applied Linguistics with TESOL 9.15 QUERY FOR RETRIEVING THE BACK CHANNELLING VOCABULARY select line, count(id) from friends where length(line) < 15 group by 1 having count(id) > 10 order by 2; # arbitrary 9.16 THE CORPUS The corpus is available for those who wish use it in further research. The text files are approximately 15 megabytes in size but compress to less than 4 megabytes. Corpus files and SQL statements for the creation of the relevant SQL tables are available free of charge at: https://sites.google.com/site/friendstvcorpus/ or via email at ayliffe.david@gmail.com 9 Anglia Ruskin University MA Applied Linguistics with TESOL 10. COPYRIGHT Attention is drawn to the fact that copyright of this Dissertation rests with: (i) Anglia Ruskin University for one year and thereafter with (ii) Mr. David Ayliffe This copy of the Dissertation has been supplied on condition that anyone who consults it is bound by copyright. This work may (i) be made available for consultation within Anglia Ruskin University Library or (ii) be lent to other libraries for the purpose of consultation or may be photocopied for such purposes. 10 </div> </div> </div> </div> </div> </div> </div> <div class="col-lg-3 col-md-4 col-xs-12"> <div class="panel-meta panel panel-info"> <div class="panel-heading"> <h2 class="text-center panel-title">anglia ruskin university a study into the use of linguistic ...</h2> </div> <div class="panel-body"> <div class="row"> <div class="col-md-12"> originated here in the UK however since then it has been exported to a further 42 countries. (Wikipedia, 2011). .... internet. Most of their text from daily news websites, exploiting their archives of material. Far from just .... The idea is simply ...... in a business environment one would expect also expect a greater tendency to talk. </div> <div class="col-md-12"> <div class="doc"> <hr /> <div class="download-button" style="margin-right: 3px; margin-bottom: 6px;"> <a href="https://p.pdfkul.com/download/anglia-ruskin-university-a-study-into-the-use-of-linguistic-_59ff27e21723dd1ed3abc691.html" class="btn btn-success btn-block"> Download PDF </a> </div> <div class="share-box pull-left" style="margin-right: 3px;">  <a href="http://www.facebook.com/sharer.php?u=https://p.pdfkul.com/anglia-ruskin-university-a-study-into-the-use-of-linguistic-_59ff27e21723dd1ed3abc691.html" target="_blank" class="btn btn-social-icon btn-facebook"> </a>  <a href="http://www.linkedin.com/shareArticle?mini=true&url=https://p.pdfkul.com/anglia-ruskin-university-a-study-into-the-use-of-linguistic-_59ff27e21723dd1ed3abc691.html" target="_blank" class="btn btn-social-icon btn-twitter"> </a> </div> <div class="fb-like pull-left" data-href="https://p.pdfkul.com/anglia-ruskin-university-a-study-into-the-use-of-linguistic-_59ff27e21723dd1ed3abc691.html" data-layout="button_count" data-action="like" data-size="large" data-show-faces="false" data-share="false"></div> <div class="clearfix"></div> <div class="row"> <div class="col-md-12" style="margin-top: 6px;"> 2MB Sizes 0 Downloads 128 Views </div> </div> <div class="clearfix"></div> <div class="row"> <div class="col-md-12"> <a data-toggle="modal" data-target="#report" style="color: #f44336;"> Report</a> </div> </div> </div> </div> </div> <h4 id="comment"></h4> <div id="fb-root"></div> <script> (function (d, s, id) { var js, fjs = d.getElementsByTagName(s)[0]; if (d.getElementById(id)) return; js = d.createElement(s); js.id = id; js.src = "//connect.facebook.net/en_GB/sdk.js#xfbml=1&version=v2.9&appId=266776430439748"; fjs.parentNode.insertBefore(js, fjs); }(document, 'script', 'facebook-jssdk')); </script> <div class="fb-comments" data-href="https://p.pdfkul.com/anglia-ruskin-university-a-study-into-the-use-of-linguistic-_59ff27e21723dd1ed3abc691.html" data-width="100%" data-numposts="6"></div> </div> </div> <div class="panel-recommend panel panel-success"> <div class="panel-heading"> <h4 class="text-center panel-title">Recommend Documents</h4> </div> <div class="panel-body"> No documents </div> </div> </div> </div> </div> <div class="modal fade" id="report" tabindex="-1" role="dialog" aria-hidden="true"> <div class="modal-dialog"> <div class="modal-content"> <form role="form" method="post" action="https://p.pdfkul.com/report/59ff27e21723dd1ed3abc691" style="border: none;"> <div class="modal-header"> <button type="button" class="close" data-dismiss="modal" aria-hidden="true">×</button> <h4 class="modal-title">Report anglia ruskin university a study into the use of linguistic ...</h4> </div> <div class="modal-body"> <div class="form-group"> <label>Your name</label> <input type="text" name="name" required="required" class="form-control" /> </div> <div class="form-group"> <label>Email</label> <input type="email" name="email" required="required" class="form-control" /> </div> <div class="form-group"> <label>Reason</label> <select name="reason" required="required" class="form-control"> <option value="">-Select Reason-</option> <option value="pornographic" selected="selected">Pornographic</option> <option value="defamatory">Defamatory</option> <option value="illegal">Illegal/Unlawful</option> <option value="spam">Spam</option> <option value="others">Other Terms Of Service Violation</option> <option value="copyright">File a copyright complaint</option> </select> </div> <div class="form-group"> <label>Description</label> <textarea name="description" required="required" rows="3" class="form-control"></textarea> </div> <div class="form-group"> <div style="display: inline-block;"> <div class="g-recaptcha" data-sitekey="6LeP2DsUAAAAAABvCByMZRCE253cahUVoC_jPUkq"></div> </div> </div> <script src='https://www.google.com/recaptcha/api.js'></script> </div> <div class="modal-footer"> <button type="button" class="btn btn-default" data-dismiss="modal">Close</button> <button type="submit" class="btn btn-primary">Save changes</button> </div> </form> </div> </div> </div>  <div class="modal fade" id="login" tabindex="-1" role="dialog" aria-labelledby="myModalLabel"> <div class="modal-dialog" role="document"> <div class="modal-content"> <div class="modal-header"> <button type="button" class="close" data-dismiss="modal" aria-label="Close" on="tap:login.close">×</button> <h3 class="modal-title">Sign In</h3> </div> <div class="modal-body"> <form action="https://p.pdfkul.com/login" method="post"> <div class="form-group form-group-lg"> <label class="sr-only" for="email">Email</label> <input class="form-input form-control" type="text" name="email" id="email" value="" placeholder="Email" /> </div> <div class="form-group form-group-lg"> <label class="sr-only" for="password">Password</label> <input class="form-input form-control" type="password" name="password" id="password" value="" placeholder="Password" /> </div> <div class="form-group form-group-lg"> <div class="checkbox"> <label class="form-checkbox"> <input type="checkbox" name="remember" value="1" /> Remember Password </label> <label class="pull-right"><a href="https://p.pdfkul.com/forgot">Forgot Password?</a></label> </div> </div> <button class="btn btn-lg btn-primary btn-block" type="submit">Sign In</button> </form> </div> </div> </div> </div>  <div class="footer-container" style="background: #fff;display: block;padding: 10px 0 20px 0;margin-top: 30px;"> <hr /> <div class="footer-container-inner"> <footer id="footer" class="container"> <div class="row">  <section class="block col-md-4 col-xs-12 col-sm-3" id="block_various_links_footer"> <h4>Information</h4> <ul class="toggle-footer" style=""> <li><a href="https://p.pdfkul.com/about">About Us</a></li> <li><a href="https://p.pdfkul.com/privacy">Privacy Policy</a></li> <li><a href="https://p.pdfkul.com/term">Terms and Service</a></li> <li><a href="https://p.pdfkul.com/copyright">Copyright</a></li> <li><a href="https://p.pdfkul.com/contact">Contact Us</a></li> </ul> </section>  <section id="social_block" class="col-md-4 col-xs-12 col-sm-3 block"> <h4>Follow us</h4> <ul> <li class="facebook"> <a target="_blank" href="" title="Facebook"> Facebook </a> </li> <li class="twitter"> <a target="_blank" href="" title="Twitter"> Twitter </a> </li> <li class="google-plus"> <a target="_blank" href="" title="Google Plus"> Google Plus </a> </li> </ul> </section>  <div id="newsletter" class="col-md-4 col-xs-12 col-sm-3 block"> <h4>Newsletter</h4> <div class="block_content"> <form action="https://p.pdfkul.com/newsletter" method="post"> <div class="form-group"> <input id="newsletter-input" type="text" name="email" size="18" placeholder="Entrer Email" /> <button type="submit" name="submit_newsletter" class="btn btn-default"> </button> <input type="hidden" name="action" value="0"> </div> </form> </div> </div>  </div> <div class="row"> <div class="bottom-footer"> <div class="container"> Copyright © 2025 P.PDFKUL.COM. All rights reserved. </div> </div> </div> </footer> </div> </div>  <script> $(function () { $("#document_search").autocomplete({ source: function (request, response) { $.ajax({ url: "https://p.pdfkul.com/suggest", dataType: "json", data: { term: request.term }, success: function (data) { response(data); } }); }, autoFill: true, select: function (event, ui) { $(this).val(ui.item.value); $(this).parents("form").submit(); } }); }); </script>  <script async src="https://www.googletagmanager.com/gtag/js?id=G-VPK2MQK127"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-VPK2MQK127'); </script> </body> </html>