etc.). Instead XML is a meta-language which allows the end-user to define their own mark-up. XML provides the structure and the mark-up is flexible. Herein lies the problem: A non-prescriptive approach is no better than no approach at all, a number of texts randomly thrown into XML documents holds few if any advantages over their native-format equivalent. It was for this reason that the Text Encoding Initiative (TEI) was embarked on. 25 Monica: What you guys don't understand is kissing is as important as any part of it. Joey: Yeah, right!.......Y'serious? Phoebe: Oh, yeah! Ross: (trying to ignore her) No. No.
Anglia Ruskin University MA Applied Linguistics with TESOL
In the early 90s work started on specifying a mark-up language for corpora. Problems with mark-up and annotation had arisen from an incompatibility across corpora and a standardized format has long been desired. The TEI1 aims to issue a standard regarding corpus design, a checklist of what to and what not to include in corpus annotation The body publishes documents in the format of document type declarations2 (or DTDs). Currently in its fifth incarnation the documents specify how mark-up should be applied. Doing this allowed paralinguistic features such as
and
Figure 6: The TEI mar kup of a punchine (own material)
As a corpus linguists the level of mark-up one needs depends on the use and the purpose of the corpus; one can opt for a simple cheese and tomato pizza (TEI Lite DTD – a minimal set) or a meat lovers supreme (TEI full). The specification is currently running into problems as the amount of mark-up present risks obscuring the ‘woods from the trees’ (echoed by Ochs, 1979) hence a layering approach is has been proposed (Witt et. al., 2009) which aims to compartmentalise multiple annotated XML files where each XML file is annotated around a theme (e.g. non-verbal actions, clauses, phonetics etc) and users can select or deselect based on their requirements. Cook (1990, 1995), has criticized this approach, viewing annotation not as an extra topping, but as a key ingredient, crucial to the interpretation of a speech event. Idealistically he is correct although, pragmatically one is restricted either financially or temporally or both and one must draw a line in terms of how detailed texts should be annotated.
1 2
http://www.tei-c.org/index.xml (Accessed 11th April 2011) http://www.tei-c.org/Guidelines/DTD/ (Accessed 11th April 2011)
26
Anglia Ruskin University MA Applied Linguistics with TESOL ‘Mark-up’ and ‘annotation' are two key terms are frequently used in describing corpus linguistics and a distinction is required. Mark-up consists of three strands and structural markup is usually textual and contextual information such as who the speaker is, their age, gender, the location of the utterance and the utterance’s place in the grander scheme of work. Part of speech mark-up is added by a tagger and categorises words; and grammatical mark-up is the annotation of grammatical structures beyond the level of the word (e.g. phrases, clauses). Mark-up defines static factual data and annotation is an umbrella term for the three different kinds of mark-up. Together all three “add value”(McEnery et. al., 2006, p.4) to a corpus by broadening the scope of what can be analysed.
1.
Annotation 1.1. Structural mark-up 1.2. Part of speech (POS) mark-up 1.3. Grammatical mark-up
(Meyer, 2004)
A corpus is not just a series of texts, instead a corpus attempts to represent some state of a language at some point in time (Biber et. al., 2006). The COBUILD corpus, for example, is a 170 million word bank of English; and is “a sample of contemporary English - no more, no less” (Francis & Sinclair, 1994, p.190). The use of the corpus will shape the design since a large multiple-genre online corpus will have vastly different requirements to that of a single-user genre-specific corpus. This ethos is epitomised by the International Corpus of English (ICE). Over twenty countries are participating in this ongoing scheme which began in the 90s. The corpus as an entity consists of a one million word snapshot of each language and is made up of approximately 60% of orthographically transcribed spoken English of various genres3. The corpus aims to represent the state of each language post-1989. Coordinated by UCL here in England each ‘team’ has a very specific set of guidelines4 which they must follow regarding the transcription of such forms as vocalised pauses, overlapping speech and non-verbal utterances (Nelson, 1995, see also The Ice Project, 2009). Interestingly, while the entire BNC is TEI compliant, the ICE project is not since the tags for this corpus were developed in the late 80s.
As of April 2011, only ten language files are available for academic analysis while the latest corpora are currently being tagged with the ‘Constituent Likelihood Automatic Word-tagging 3
http://ice-corpora.net/ice/design.htm (Accessed 11th April 2011) http://ice-corpora.net/ice/written.doc & http://ice-corpora.net/ice/spoken.doc (Accessed 11th April 2011) 4
27
Anglia Ruskin University MA Applied Linguistics with TESOL System version 7’ (CLAWS7 tagset hereafter), a grammatical tagset and the ‘UCREL Semantic Analysis System’ (USAS hereafter) tagset, a semantic tagset (The ICE Project, 2009). As has been discussed in the previous section, spontaneous spoken discourse is notoriously difficult to both transcribe and then analyse given the large amount of false starts, repetition, elided phrases ungrammatical forms, interruptions and the overwhelming reliance on context to discern any meaning. Advances in computer technology have made this process somewhat easier and with every year which passes our competence in the field grows. Technologies such as ASR (automatic speech recognition) are evolving at tremendous rates. Just five years ago this author was part of a software engineering team responsible for a system which facilitated the transcription of hundreds of thousands of medical dictations daily. Doctors phoned one of the DVI’s (digital voice interfaces) which they dictated their patient diagnosis (dictations) to; these could then be converted from speech to text using our dedicated ASR engine within minutes and a 30 minute diction could be ‘turned-around’ and emailed back to the doctor in a text-based report-format within 6 hours. The accuracy of the automated transcription was so high that while the dictation was still presented to a human this process changed from one of blank-slate transcription to correcting the output the engine produced. Unfortunately, the pre-requisites for such a system were great and the costs very real but such an approach was longitudinal and suited our environment where we had a finite number of dictators (although still approximately 100,000). There are obvious benefits of such a technology in speech corpus analysis and just as the cost benefits of storage have come down, one must anticipate a increase in the quality of ASR.
Semantic annotation remains a pipe dream computational linguistics although some innovative approaches to the problem are starting to emerge huge problems still remain. On the back of this problem, Essex University have devised a “Phrase Detective”, essentially a game where users have to compete with one-another to identify anaphoric, cataphoric or exophoric references in the presented text (see Figure 7 on the next page). The idea is simply to get others to compete to do the time-consuming and expensive annotation. Sinclair (2002), warns that the introduction of a human element results in a decline in the consistency of annotation, undoubtedly true although this issue has been anticipated in this innovative project and the same phrase is detected by many different competitors, and other checks and balances are in place (see Chamberlain et. al., 2008) to wean out the correct answer as a law of averages.
28
Anglia Ruskin University MA Applied Linguistics with TESOL
Figure 7: Phrase Detectives (http://www.phrasedetectives.or g)
Figure 8: reCAPTCHA (from Google, 2011)
reCAPTCHA5 is another innovative solution in the field of distributed computing. Now owned by Google reCAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart (the ‘re’ is unexplained). Google is running a service to digitise the world’s information and separately, websites need a way to be able to distinguish machines from humans. Therefore since Google needs eyes to verify the text which their OCR (optical character recognition software) has been unable to identify they make these images availabl e, for free, to those who need an authentication system. The system reportedly (Google, 2011) displays over 100 million reCAPTCHAs (e.g. Figure 8) a day, and users around the world help Google fill-in-the-blanks in their digital archive. An innovative solution to a big problem and their paper (von Ahn et. al., 2008) goes into the nuts and bolts of it in much more detail.
5
http://www.google.com/recaptcha (Accessed 24th May 2011)
29
Anglia Ruskin University MA Applied Linguistics with TESOL In both of these projects the train of thought is the same and while humans are still integral to the process the shift from transcribers at a computer to a distributed collaborative approach is an interesting one. Almost twenty years ago Sinclair stated: "As the size of corpora moves into the hundreds of millions, the futility of reliance on human intervention becomes clearer, although it can be temporarily obscured by throwing money at it" (Sinclair, 1992, p.382) No-one would argue the effectiveness of humans and these two projects have begun to show the direction in which humans can be utilised. The problem now becomes one of harnessing such a tool; easy enough for a colossal such as Google but difficult for academia. The grammatical mark-up of text is relatively low-tech and is stable allowing efforts to be focused elsewhere and it must be with semantic annotation that the attention now turns. Speech recognition too continues to evolve and much of the literature on this subject is out of date. One hopes the heavyweights in the field (Microsoft, Phillips et. al.) continue to innovate and this is a technology which looks set to be an integral part of corpus linguistics in the near future.
30
Anglia Ruskin University MA Applied Linguistics with TESOL
3. BUILDING AND USING THE CORPUS The choices which underpin the decisions made in the design of this corpus were made with the answers to two questions: 1)What is available? 2)What does the research question set out to find?
“An over-ambitious system could strangle a corpus at birth” warns Atkins et. al. (1992, p.14), and while this project would ideally draw comparisons with language of equal size from a nonAmerican source but all attempts to harvest such data proved fruitless. British TV comedies are routinely too short – typically just 6 episodes per series and none of them draw the fan base necessary to produce ready-to-use transcripts - in hindsight this was a blessing. In this study, the medium, genre and mode of delivery are all static variables, speaker details are available for use and a limited amount of contextual information is available some of which was used.
Corpus linguist Leech’s stated that “a great deal of spadework has to be done before the research results can be harvested” (Leech 1998, p.17), and such a succinct phrase encapsulates a process of many man hours and much experimenting. The time and effort required in getting the data into the database in and into a reasonable state was a laborious task. As discussed, the design of a corpus is heavily influenced by its intended use, and given that this corpus will not be used in a high-transactional environment nor will be used by different researchers with very different needs the design was made stable and static. This process will be outlined below.
3.1
CREATING THE CORPUS
The transcripts for all 234 episodes were downloaded6 free of charge from the internet. These had been manually transcribed by a group of fans of the show. Steps taken to ensure the integrity of the scripts matched the actual spoken word is subject to scrutiny and will be clarified later.
One file represented one episode, although in three cases the scripts for two episodes were contained in the same file as they aired together. The files were in hypertext mark-up language (HTML) format – the standard format for web pages. All HTML mark-up was 6
th
http://www.friendstranscripts.tk/ (Accessed 28 April 2011)
31
Anglia Ruskin University MA Applied Linguistics with TESOL automatically removed from these files, and this was done using a freely available application “HtmlAsText.exe”7. HTML file including examples of metadata (underlined): [Scene Central Perk, everyone's there.]
Text file with the HTML mark-up removed: [Scene Central Perk, everyone's there.] Monica: What you guys don't understand is kissing is as important as any part of it. Joey: Yeah, right!.......Y'serious? Phoebe: Oh, yeah! Ross: (trying to ignore her) No. No. Figure 10: Text file sans HTML mark-up (Excerpt from 0102.txt)
Once all files were in this state a further stage was run which was to remove all ‘unnecessary’ information from the files (specifically the notes in brackets). This resulted in a series of utterances which were similar in aesthetically to: Monica: What you guys don't understand is kissing is as important as any part of it. Joey: Yeah, right!.......Y'serious? Phoebe: Oh, yeah! Ross: No. No. Figure11: Utterances before being loaded into the dat abase
The reasons for removing this “data about data” (McEnery et. al., 2006, p.22) were that a) Its presence may influence the results of the POS tagger. b) Its presence may influence the results of the regular expressions. c) It is entirely subjective, entered at the whim of the transcriber. d) The non-linguistic actions are not the primary focus of the study.
One item of metadata which has been retained in some form is the scene demarcations. For example: “*Scene Central Perk, everyone's there.+”
7
http://www.nirsoft.net/utils/htmlastext.html (Accessed 28th April 2011)
32
Anglia Ruskin University MA Applied Linguistics with TESOL These markers preceded any dialogue to which that scene belonged. These indicators have been removed and do not exist in their original form instead (and please refer to Tables 3 & 4 on pages 44 and 45 respectively) each utterance of that scene has been marked up with a scene reference number. This enabled identification of scenes where only males speak (n=306), only females speak (n=241), scenes with both (n=1764) and scenes which feature an unidentified speaker (n=781).
There are 3092 unique scenes throughout the entire 10
seasons. Problems remain, again these scene demarcations were marked sporadically and some episodes (n=approximately 4) contain no scene information at all, therefore these episodes where watched and the scene information was marked manually. Being able to identify if a scene contains only males or only females is vital for comparing both inter-group and intra-group linguistic strategies – statistics at the heart of this project. Initially this scene information was disregarded however after the pilot it became apparent that knowledge of this would be valuable.
3.1.1
THE DATABASE
MySQL is a popular, high performance database. It is open source which means it is available for free unrestricted use for all but commercial entities, and it was the de facto choice for a number of reasons: 1. The researcher has personal experience with the product having used it extensively at undergraduate level. 2. It is known to be fast. 3. It allows regular expressions to be used to interrogate the data. Regular expressions are an intensely powerful text manipulation language (discussed in detail in 3.2)
In light of these factors no other database system was considered. Loading the data from the file system into the database system was straightforward. The command below takes the text file ‘0102.txt’ (the text from season 1 episode 2) and loads it into the column named ‘line’ in the table named ‘friends’.
mysql> load data infile "c:\\friends\\0102.txt" into table friends lines terminated by "\r\n" (line) set filename = "0102.txt";
This command was repeated for each file and the result of this was a 60,849 row table where each row represents an utterance The data as an entity consists of over 4 million characters 33
Anglia Ruskin University MA Applied Linguistics with TESOL and over 2 million words. It is worth pointing out that the each utterance is stored three times. This is best illustrated with an example (see Table 3 on page 44). 1. The ‘original line’ column is identical to that contained in the downloaded transcript. 2. The ‘line’ column represents the ‘tidy’ version of the original line. 3. The ‘metadata’ column shows the line data with POS tags.
3.1.2
‘TIDYING’ THE DATA
Once the data, had been loaded into the database the data could be cleaned further. This essentially involved removing repeated words and truncating superfluous characters and capitalizing the first character of each utterance. While the CLAWS tagger does allow for the presence of repeating words it was considered desirable to remove them. Originally, these steps were done purely for the benefit of analyst however they also had the secondary benefits of
(1)
marginally speeding up query response time,
(2)
minimizing the margin for error
when using regular expressions.
The following are examples of changes which have been made (not a definitive list): a) No no no no no no no no no No no b) Whaaaaaaaaaaaaaaaaaaatsup!!!!!!!!!! Whatsup! c) It was- I mean he did it It was I mean he did it d) I was so so so so so so happy to see him I was so so happy. e) I was soo happy I was so happy. Figure 12: Tidy data (not genuine utterances)
After an initial pilot in which the first twenty thousand utterances were POS tagged, tagging inconsistencies were noticed. This was to be expected since misspellings and the protruding hyphens in words (e.g. “was-“) caused the word to be tagged as an unknown word. There is no good reason these protruding characters should be present and since they were present in the original HTML files one can only attribute transcriber tardiness. This process of removal could only be done part automatically as the hyphen serves a morphological function in words such as “ex-girlfriend”.
Regarding misspellings, the BNC holds a ’control list’ of all permissible non-standard forms such as gonna, y’know, ‘cause and so on (Crowdy, 1994). Such considerations are applicable when dealing with a 100 million multiple-genre corpus however the overhead associated with 34
Anglia Ruskin University MA Applied Linguistics with TESOL creating such a list was considered too great in this instance. Consequently the forms ‘uh’, ‘yep’, ‘yeah’, ‘y’know’ ‘cause’ ‘‘cause’ and ‘gonna’ were all left unaltered. These forms represented no challenge for the POS tagger which correctly tagged these words.
A semi-rigorous form of normalisation was taken to correct multiply occurring misspellings. Identifying misspellings was a manual exercise aided greatly by Microsoft Word’s ability to automatically underline incorrect pasted text. Priority was given to those misspellings which were important to this study although very few were actually identified, and American spellings remained in place. While not a misspelling One pertinent modification to the data was made; this is shown below: All right alright After a pilot run trying to extract tag questions of the form e.g. “You’re OK, right?” a large amount of ‘pollution’ was identified as the result of ‘all right’ being two separate strings therefore this string was normalised to the one word equivalent. Numerous other changes were made, for example, all commas were removed from the metadata column (only), again their presence was sporadic and arguably unnecessary in transcribed speech. All changes that were made were done so with the intention of limiting the pollution of the results.
It is important to note that where changes were made the original utterance exists unaltered in a separately stored field in the database (the column: ‘original_line’). This is important for a number of reasons:
Future researchers using this same data can see clearly what has been changed.
Any undesirable change/unexpected side effects can be rolled-back, starting afresh with the original data.
The integrity of the data can be verified.
This ‘original’ column contains the utterance almost exactly as it was in the downloaded transcripts. ‘Almost exactly’ because some modifications were made to this data to get the data to load into the database. Problems were encountered with non-ASCII characters. A full discussion of Unicode vs. ASCII is not warranted however briefly Unicode is a character set which allows every character in every language to be represented – the holy grail of internationalisation; while ASCII is a 128 character hangover from short-sighted decisions made decades ago – a subset representing only Western English characters. Transcribers had, in places, used a non-English language and non-standard punctuation outside of the 128 range ASCII provides. There are at least two occasions in the series when a language other than 35
Anglia Ruskin University MA Applied Linguistics with TESOL English is used and the transcribers had transcribed this speech in the character set of that foreign language. Support for Unicode (UTF-8 being the character set) is an integral part of every web browser and hence the document would have rendered properly however trying to load such characters into an ASCII compliant database proved futile and hence all non-ASCII characters were either removed or changed to their closest ASCII equivalent. In this situation the information was lost; however given that this only applied to none English characters it was not perceived as a big problem.
Grammatically, the POS tagger was intelligent enough to recognize syntactic differences such as “It’s my apple” vs. “Its my apple”. Despite the inaccuracy the POS tagger applied the correct part of speech. This came as a huge relief as the transcription of possessives and contractions, in places, left a lot to be desired and the only way to have corrected such instances would have been very time consuming indeed.
3.1.3
TAGGING THE DATA
Garside et al. (1997) define corpus annotation as “the practice of adding interpretative, linguistic information to an electronic corpus of spoken and/or written data.” (p.2) . It is asking a great deal to expect a “standard notation to provide a usable common framework for the actual categories used in grammatical tagging” (Atkins et. al., 1993, p.24), and consequently, there are a variety of different part-of-speech annotation tagsets:
AMALGAM tagset used in the Brown Corpus and the London-Lund Corpus.
CLAWS1 tagset used in the Lancaster-Oslo/Bergen Corpus and the Spoken English Corpus (with minor changes).
CLAWS5 tagset used in the BNC.
CLAWS7 tagset used in ICE (and part of the BNC).
UPenn tagset used in the University of Pennsylvania Corpus.
They are all similar with variations on the finite detail they attempt to extrapolate from the text and different codes for parts of speech. Many (e.g. CLAWS) having been spawned from or supersede a previous version. Once the data was loaded into the database the text was annotated. The automatic tagging of text is a service which is rarely free - either in terms of financial cost or in terms of errors in the annotated text. The criteria for the selection of a tagset was as follows. Access to the tagger must
(1)
be available in an unhampered form free of charge
(2)
provide the reasonable lowest
(3)
error rate have ‘stood the test of time’ academically. Ultimately a decision was made to use the CLAWS7. It provides the largest tagset of all CLAWS version (146 categories – although 36
Anglia Ruskin University MA Applied Linguistics with TESOL CLAWS8 is in the pipeline which is set to surpass it). The rationale behind this is that such a large and rich tagset provides “distinct codings for all classes of words having distinct grammatical behaviour" (Garside et. al., 1987, p.167) and it has an established history. The annotation of data using this tagset has been proven to have an error rate of marginally more than 1% in both the processing of written and spoken data (1.14%, 1.17% respectively; UCREL, 2000).
The conundrum posed by Atkins et. al (1992) is one of determining if ‘a male frog’ is a noun phrase made up of an adjective and a noun or a straightforward compound noun. I argue that, as long as the behaviour of this classification is consistent and since comparisons with other corpus will be limited, this is an issue which was given due consideration but ultimately was not a show-stopper. In selecting the CLAWS tagger comparisons with other corpora also marked up with the same tagset would be valid.
3.1.4
DETERMINING AND ASSIGNING GENDER
In this corpus, gender was assigned manually. A query determined the most vocal actors and priority was given to accurately assigning these people’s gender as it was imperative to the aims of this study.
The gender of other the actors was ultimately guessed but using
background knowledge of the series and a broader knowledge of which names are male and which are female.
Post assignment, 56% of all actors/roles remained un-gendered (n=479, total=788) however this accounted for less than 3.8% of the utterances. Where characters are void of an assigned gender this is due of a number of reasons:
The character name is ambiguous e.g. “Realtor”.
The character name could refer to one or more roles played by one or more genders e.g. “Nurse”, “Student”.
The gender of the name was ambiguous e.g. “Max”, “Sam”, “Jessie”, “Alex” etc.
37
Anglia Ruskin University MA Applied Linguistics with TESOL
3.1.5
THE INTEGRITY OF THE DATA
A number of episodes were viewed and checked for accuracy (n=4), and very few discrepancies were found establishing confidence that the transcribed format accurately matched the actual spoken word. While not 100% accurate the corpus was certainly close enough and any attempts to rectify such inaccuracies would be prohibitively time consuming. As Burnard (1996) notes in relation to the BNC, for transcribing spoken data manually typing in the text “was the only option… and proved to be very expensive and time-consuming, in part because of the very high standards set for data capture”.
Viewing these four episodes had the added advantage that one could visually identify the gender of the actors which were programmatically marked as unknown. As an example, in episode 3 in season 2 *after viewing the episode+ the waiter’s gender was changed now correctly being marked as male. “WAITER”
U
- “WAITER IN 203”
M
The waiter in this episode only was renamed and the gender of this ‘new’ waiter was assigned. There were many “waiters” across the ten seasons and importantly this change did not affect any other occurrences, all of whom remained un-gendered. The 10 most vocal actors are detailed here: Rank
Person
Gender
# of utterances
F
9217
1
RACHEL
2
ROSS
M
9031
3
CHANDLER
M
8370
4
MONICA
F
8335
5
JOEY
M
8131
6
PHOEBE
F
7461
7
MIKE
M
359
8
ALL
U
345
9
RICHARD
M
281
F
217
10
JANICE
Table 1: 10 most vocal actors
38
Anglia Ruskin University MA Applied Linguistics with TESOL And a brief overview of the corpus demographics is given here: Raw count
% of corpus
Total count of utterances from males
29,706
48.8
Total count of utterances from females
28,842
47.4
2,301
3.8
Total count of utterances from unknown spe akers
100% Total count of utterances from males in male only scenes
3,784
6.2
Total count of utterances from females in female only scenes
3,357
5.5
Total count of utterances from either gender in mixed-gender scenes Total count of utterances from anyone in scenes fe aturing an unknown speaker
34,667
56.9
19,040
31.3 100%
Table 2: Data counts
Vital to note is that in the interests of integrity all scenes which featured an unknown speaker were not included in any of the results. Obviously they were not included in the single-gender results since they have not been determined to be of either male or female gender but the decision was also taken not to include them in the mixed-gender results either. The role of these un-gendered actors is problematic. Consider the following: A scene with all three main male actors, each is happily holding their own in the ensuing discussion and then at some point they all exclaim the same phrase/word together. This synchronous proclamation could have been transcribed as: “ALL: No!” Were this scene to be included in the mixed-gender data a certain level of pollution would occur. Furthermore, the mean and the median scene length are closely aligned (20 and 17 utterances per scene respectively). Fifteen scenes have an utterance count of more than 100, and given that the average episode length is 268 utterances it is naïve to assume that scenes were accurately marked. In light of this sketchy scene demarcation it was anticipated that numerous scenes have been marked as one. This is a problem because if a ‘super-scene’ features a number of single-gender ‘mini-scenes’ and this super-scene were included in the mixed-gender statistics then, again, the data would be subject to a certain amount of pollution.
The decision to discard all data featuring an unknown speaker was not one which was taken lightly as it meant effectively throwing away a third of the corpus (31.3%, see Table 2 above). As undesirable as this was it was ultimately deemed necessary in the interests of integrity. It is also unfortunate that the single-gender data is based around a mere 12% of the corpus (roughly 6% for each gender; again see Table 2), more data is always desirable and while many 39
Anglia Ruskin University MA Applied Linguistics with TESOL studies have been done with much less data more data would have been advantageous in establishing greater confidence in the results. The only solution to this problem would have been to manually view every episode and mark the scenes accurately and the gender of the actors individually a process which would have taken over 100 viewing hours.
3.2
OBTAINING THE RESULTS
Regular expressions is a intensely powerful pattern matching cum pseudo-programming language in computing. The use provides a concise method for matching very complicated arrangements of strings. This concept is best illustrated with an example:
The form of a tag question is relatively static: there is a statement and then the tag is the inversion of the auxiliary verb followed by a pronoun. Syntactically this gives rise to a great number of variations: i.
You didn’t like it, did you?
ii.
He didn’t like it, did he?
iii.
They didn’t like it, did they?
iv.
They didn’t like the fact that you left them on their own while you went off gallivanting around town with someone their friend whom they didn’t like, did they?
v.
You like it, don’t you?
vi.
You like it, do you not?
vii.
You don’t like it, do you?
viii.
You liked it, didn’t you?
ix.
You liked it, did you not?
x.
You didn’t like it, did you?
(not a definitive list)
Also the same pattern can be repeated with other auxiliary tags (and their negatives) such as: 1)
is it?
2)
are you?
3)
have you?
4)
will you?
Also notice how the tag “right” can be applied instead of any tag for example: “You didn’t like it, right?”
40
Anglia Ruskin University MA Applied Linguistics with TESOL Coupling the power of regular expressions with the power of a database makes retrieving these occurrences easier than it would otherwise be. This is an early incarnation of query used to extract tag questions from the database, and line numbers have been added to aid explanations.
1
# tag questions ("You had sex with her, didn't you?")
2
select id, person, gender, line
3
from friends
4
where metadata regexp ".*VDD.{0,15}(XX.{1,15})?PPHS1.{0,5}[[.question-mark.++”
Example of “line” field:
You had sex, didn't you?
Example of “metadata” field:
You_PPY had_VHD sex_NN1 ,_, did_VDD n't_XX you_PPY ?_?
Line 1 is a comment line indicating what the query does with an example. It performs no function. Line 2 indicates the items of data the query will return. Line 3 declares the table to be queried. Line 4 is the key line. The field metadata is queried for every one of the sixty thousand utterances. If (and only if) the regular expression matches will the matched data be displayed. The regular expression (between the “” in line 4) indicates: .*
= match any character (.) any number of times (*). This is used because tag
questions cannot be guaranteed to occur at the start of an utterance. VDD
= after the previous criteria has been met find me an utterance with this tag.
This is a tag used in the CLAWS 7 tagset8, it represent ‘did’. .{0,15}
= then, after no fewer than 0 but no more than 15 characters [after matching
‘did’+ do the following: (XX.{0,15})?
= possibly (?) match any negation tag (XX) e.g. “not”, “n’t” and then after no
fewer than 0 but no more than 15 characters do the following: PPHS1
= find a second person personal pronoun i.e. you.
.{0,5}
= then after no more than 5 characters
[[.question-mark.]]
= match a question mark. The symbol ? is a special character so this
alternative form is used.
Note how this query pays no attention to the front of the tag, it is purely concerned with how the tag ends. This simple regular expression will match tag questions such as:
8
http://ucrel.lancs.ac.uk/claws7tags.html (Accessed 8th June 2011)
41
Anglia Ruskin University MA Applied Linguistics with TESOL You saw him, did you? Anything can go here, didn’t you? Ross! Did you see? You said you’d be there. Didn’t you? (not an exhaustive list)
While the following are missed (again, not an exhaustive list): * You saw him didn’t you!
(must end with a ?)
* You called her, did you not
(grammatically a perfect tag question but the inversion of the subject and the negation means the pattern is not matched)
* You called her didn’t you Rachel?
(? must be no more than 5 characters away from the pronoun)
The regular expression the results are based on is much more complicated as it attempts to deal with some of these issues (please see the Appendix for the exact expression). The MySQL web site9 provides more details about the use of regular expressions.
As has been demonstrated, the construction of regular expressions is a complex area and their accurate execution relies on accurate data.
It is anticipated that expressions similar
grammatically to “I sometimes like you is it?” are not present in the data since no script writer is likely to get such a malformation of words past the editors and onto a primetime television show. Errors in the process of transcription however do make such structures possible but given the fact that this data was publically available for many years, again it is anticipated that these errors have been ironed out. Given this complexity, the margin for error and the amount of manual effort taken to construct such queries it is easy to see why the BNC facilitates a simple one word search (Figures 13 & 14), while third-party interfaces are still fairly rigid.
9
http://dev.mysql.com/doc/refman/5.1/en/regexp.html (Accessed 8th June 2011)
42
Anglia Ruskin University MA Applied Linguistics with TESOL
Figure 13: User interface for the BNC
Figure 14: Custom BNC User interface at http://corpus.byu.edu/bnc/
43
3.3 ID
Scene
1
A SUMMARY OF THE DATA AND WHAT IT LOOKS LIKE Person
Gender
Original Line
Line
Metadata
File
1
MONICA
F
1
JOEY
M
201
1
CHANDLER
M
301
1
PHOEBE
F
There's nothing to tell! He's just some guy I work with! C'mon, you're going out with the guy! There's gotta be something wrong with him! Alright Joey, be nice. So does he have a hump? A hump and a hairpiece? Wait, does he eat chalk?
There_EX 's_VBZ nothing_PN1 to_TO tell_VVI !_! He_PPHS1 's_VBZ just_RR some_DD guy_NN1 I_PPIS1 work_VV0 with_IW !_! C'm_VV0 on_RP you_PPY 're_VBR going_VVG out_RP with_IW the_AT guy_NN1 !_! There_EX 's_VHZ got_VVN ta_TO be_VBI something_PN1 wrong_JJ with_IW him_PPHO1 !_! All_RR21 right_RR22 Joey_NP1 be_VBI nice_JJ ._. So_RR does_VDZ he_PPHS1 have_VHI a_AT1 hump_NN1 ?_? A_AT1 hump_NN1 and_CC a_AT1 hairpiece_NN1 ?_? Wait_VV0 does_VDZ he_PPHS1 eat_VVI chalk_NN1 ?_?
0101.txt
101
Monica: There's nothing to tell! He's just some guy I work with! Joey: C'mon, you're going out with the guy! There's gotta be something wrong with him! Chandler: All right Joey, be nice. So does he have a hump? A hump and a hairpiece? Phoebe: Wait, does he eat chalk?
401
1
PHOEBE
F
1
MONICA
F
601
1
CHANDLER
M
Just, 'cause, I don't want her to go through what I went through with Carl- oh! Okay, everybody relax. This is not even a date. It's just two people going out to dinner and not having sex. Sounds like a date to me.
Just_RR 'cause_CS I_PPIS1 do_VD0 n't_XX want_VVI her_PPHO1 to_TO go_VVI through_II what_DDQ I_PPIS1 went_VVD through_RP with_IW Carl-_NN1 oh_UH !_! Okay_RR everybody_PN1 relax_VV0 ._. This_DD1 is_VBZ not_XX even_RR a_AT1 date_NN1 ._. It_PPH1 's_VBZ just_RR two_MC people_NN going_VVG out_RP to_II dinner_NN1 and-_NN1 not_XX having_VHG sex_NN1 ._. Sounds_VVZ like_II a_AT1 date_NN1 to_II me._NNU
0101.txt
501
Phoebe: Just, 'cause, I don't want her to go through what I went through with Carl- oh! Monica: Okay, everybody relax. This is not even a date. It's just two people going out to dinner and- not having sex. Chandler: Sounds like a date to me.
701
1
CHANDLER
M
1
ALL
U
Alright, so I'm back in high school, I'm standing in the middle of the cafeteria, and I realize I am totally naked. Oh, yeah. Had that dream.
Alright_RR so_CS I_PPIS1 'm_VBM back_RP in_II high_JJ school_NN1 I_PPIS1 'm_VBM standing_VVG in_II the_AT middle_NN1 of_IO the_AT cafeteria_NN1 and_CC I_PPIS1 realize_VV0 I_PPIS1 am_VBM totally_RR naked_JJ ._. Oh_UH yeah_UH ._. Had_VHD that_DD1 dream_NN1 ._.
0101.txt
801
Chandler: Alright, so I'm back in high school, I'm standing in the middle of the cafeteria, and I realize I am totally naked. All: Oh, yeah. Had that dream.
901
1
CHANDLER
M
1
JOEY
M
Then I look down, and I realize there's a phone... there. Instead of...?
Then_RT I_PPIS1 look_VV0 down_RP and_CC I_PPIS1 realize_VV0 there_EX 's_VBZ a_AT1 phone_NN1 ..._... there_RL ._. Instead_CS21 of_CS22 ..._... ?_?
0101.txt
1001
Chandler: Then I look down, and I realize there's a phone... there. Joey: Instead of...?
1101
1
CHANDLER
M
Chandler: That's right.
That's right.
That_DD1 's_VBZ right_JJ ._.
0101.txt
1201
1
JOEY
M
Joey: Never had that dream.
Never had that dream.
Never_RR had_VHD that_DD1 dream_NN1 ._.
0101.txt
1301
1
PHOEBE
F
Phoebe: No.
No.
No._NN1
0101.txt
1401
1
CHANDLER
M
Chandler: All of a sudden, the phone starts to ring. Now I don't know what to do, everybody starts looking at me.
All of a sudden, the phone starts to ring. Now I don't know what to do, everybody starts looking at me.
All_RR41 of_RR42 a_RR43 sudden_RR44 the_AT phone_NN1 starts_VVZ to_TO ring_VVI ._. Now_RT I_PPIS1 do_VD0 n't_XX know_VVI what_DDQ to_TO do_VDI everybody_PN1 starts_VVZ looking_VVG at_II me._NNU
0101.txt
Table 3: The first 15 rows of the data
44
0101.txt
0101.txt
0101.txt
0101.txt
0101.txt
0101.txt
0101.txt
Anglia Ruskin University MA Applied Linguistics with TESOL
SCENE ID DESCRIPTION 1 [Scene: Central Perk, Chandler, Joey, Phoebe, and Monica are there.] 2 [Scene: Monica's Apartment, everyone is there and watching a Spanish Soap on TV and are trying to figure out what is going on.] 3 [Scene: The Subway, Phoebe is singing for change.] 4 [Scene: Ross's Apartment, the guys are there assembling furniture.] 5 [Scene: A Restaurant, Monica and Paul are eating.] 6 [Scene: Monica's Apartment, Rachel is talking on the phone and pacing.] 7 [Scene: Ross's Apartment; Ross is pacing while Joey and Chandler are working on some more furniture.] 8 [Scene: A Restaurant, Monica and Paul are still eating.] 9 [Scene: Monica's Apartment, Rachel is watching Joanne Loves Chaci.] 10 [Scene: Ross's Apartment, they're all sitting around and talking.] Table 4: Scene Dat a
45
START ID 1 5201
END ID LENGTH INTERACTION 5101 52 F,M,U 10701 56 F,M,U
10801 10901 12601 13501 13601
10801 12501 13401 13501 14401
1 17 9 1 9
14501 15901 16101
15801 16001 16501
14 M,F 2 U,F 5 M
F M M,F F M
In the ‘scenes’ data , interaction marked ‘M,F’ indicates both male and female participants while a scene marked ‘F,U’ indicates a female conversation with a person of undetermined gender; interaction of simply ‘F’ or ‘M’ indicate an all-girl or an all-boy conversation. ‘Length’ indicates the number of utterances in the scene. The number of interlocutors is not stored nor considered important – this information can be gleaned by using the ‘person’ field in the Friends table. The ‘start id’, ‘end id’ and ‘description’ are purely peripheral and served no purpose in this study. They were stored on the premise that they may have proved useful, and future researchers may indeed want this data. IDs in the Friends table have been incremented in steps of 100, this was done in case missing utterances were identified during the viewing of certain episodes. These missing utterances could then be ‘slotted in’ at the appropriate place.
Scenes obviously have a one-to-many relationship with utterances as shown below.
10
Figure 15: Links between data
10
th
Created using MySQL Workbench (http://wb.mysql.com/) (Accessed 9 June 2011)
46
4. AN ANALYSIS AND DISCUSSION OF THE RESULTS The results of this study have all revolved around what Lakoff (2004) has defined as typifying women’s language. These results have all been normalised to a common base of frequency per one thousand utterances. One thousand was chosen as the common base given the advice of Biber et. al (1998). This data has been calculated using the following formula.
For example: Gender
Interaction
F
F
F
M,F
M
M
M
M,F
Total Utterance Count
Count of “oh”
Rate per thousand
3357
553
164.73
34667
2717
78.37
3784
324
85.62
34667
1695
48.89
Table 5: Normalising the data. Statistics of “oh” hedge.
A brief overview of the corpus as documented in section 3 is provided again for clarity:
Category Corpus Details
Percentages Utterances
Male
Female
Same-sex
6.21%
5.51%
Mixed sex
28.62%
28.34%
Same-sex
3,784
3,357
Mixed sex
17,419
17,248
Male
Female
Same-sex
38.58
44.38
Mixed sex
16.87
18.14
Same-sex
13.48
11.92
Mixed sex
6.26
6.92
Same-sex
1.32
0.89
Mixed sex
0.87
0.72
Same-sex
0.79
1.79
Mixed sex
0.23
0.66
Same-sex
0.23
0.89
Mixed sex
0
0.14
Same-sex
6.61
8.34
Mixed sex
2.74
1.99
Table 6: Corpus Details
4.1
HEDGES Category
“Y’know” + “You know…”
Hedges
“I think…” “I’m sure…” “… sort of…” + “… sorta…” “I wonder…” “I guess…”
47
Anglia Ruskin University MA Applied Linguistics with TESOL "Could you…" "Well…" "Oh…"
Same-sex
0.29
0.6
Mixed sex
0
0.12
Same-sex
51.8
48.85
Mixed sex
21.95
22.59
Same-sex
85.62
164.73
Mixed sex
48.89
78.37
Table 7: Hedges Results
Lakoff coined the name ‘hedge’, so it is apt to use her definition of what constitutes one. She states that hedges are "words that convey the sense that the speaker is uncertain about what he (or she) is saying, or cannot vouch for the accuracy of the statement" (2004, p.53).
A hedge is a mitigating device used to lessen the impact of an utterance, and typically, they are adjectives or adverbs, but can also consist of clauses, and they could be regarded as a form of euphemism.
Hedges secure relationships and collaborative talk as they protect the
interlocutors' feelings. Fishman (1978) coined the term ‘interactional shitwork’ to describe the work women have to do to maintain a conversation, and in effect women are thought to use discourse markers such as these hedging devices more in lieu of the minimal responses they get from their male interlocutors.
Examples of hedges: a. There might just be a few insignificant problems we need to address. (adjective) b. The party was somewhat spoiled by the return of the parents. (adverb) c. I'm not an expert but you might want to try restarting your computer. (clause)
Hedges may intentionally or unintentionally be employed in both spoken and written language since they are crucially important in communication; they also help speakers and writers communicate more precisely the degree of accuracy and truth in assessments. For instance, in “All I know is smoking is harmful to your health”, ‘all I know’ is a hedge that indicates the degree of the speaker’s knowledge instead of only making a statement, “Smoking is harmful to your health”. There are three different types of hedges (Lakoff, 2004) I.
Fully legitimate – the speaker is genuinely unsure of the facts
II.
Justifiable – used for the sake of politeness
III.
Neither of the above
48
Anglia Ruskin University MA Applied Linguistics with TESOL It is this third case which Lakoff highlights as typifying ‘women’s language’. Herein lies the fundamental problem of qualitative determining which type the hedge belongs to. And even if one can objectively determine the function at an utterance level, repeating this across a 2 million word corpus is a huge problem.
The concept of using ‘y’know’ as a marker of solidarity is a popular one (Schiffrin, 2001; Fraser, 1990) however the phrase also serves as an integral part of the syntax which cannot be omitted. Consider the following: a) Well, you know how we’re different. b) If you know somebody who’s there you know if you’re going to stay. c) Whether you know it or not… d) You know Jim Sellars the M.P.? e) It’s not what you know who you know. (Taken from Macaulay, 2002, pp.751-752)
To grammatically say that ‘if the constituent can be omitted it is a true discourse marker, otherwise it is syntactically necessary’ is an oversimplification however again quantifying such occurrences over a vast array of data is again impossible. To identify and count each instance where “you” and “know” (or their contracted forms) occur side -by-side is trivial and the statistics presented do just that.
‘Sort of’ is no less problematic, it is and is not a hedge depending on its use; the statement: ‘he is sort of tall’ is not a hedge, it is merely descriptive. Its function as a discourse marker is only ascertainable through conversational analysis and with an intense knowledge of the interlocutors, although even then opinion may well be divided - tallness is, after all, a subjective quality.
The discourse marker ‘well’ is also fraught with problems. Firstly it has several homonyms: a) As a manner adverb - She draws well. b) As a degree word - You know that perfectly well. c) As a noun - Everyone digs their own well. d) And as a verb - Tears well in my eyes. (Taken from Jucker, 1993, p.436)
49
Anglia Ruskin University MA Applied Linguistics with TESOL Secondly, Labov and Fanshel (1977) state that using ‘well’ alludes to an item of shared knowledge going on to detail how its use shows a “joint concern” (p.157) of the topic at hand. Quite how a third party can ascertain whether a statement is shared knowledge or not remains unknown. The use has many other different interpretations (see Schiffrin, 1985, Lakoff, 1973, 1973b) from indicating an insufficiency in the answer to acting as “a qualifier and as a frame” (Jucker, 1993, p.437) marking a direct response to the utterance before it.
A: Did you kill your wife? B: Well, yes…. (Lakoff, 1973b, p.459)
Statistically (see Figure 16 & 17 on the next page), ‘oh’ and ‘well’ are the most heavily used markers and interestingly these are also only two discourse markers which Schiffrin (1987) describes as having no meaning. In line with the observations of Sacks et al. (1974) the data supports the conclusions that occurrences of ‘well’ overwhelmingly begin turns. The Friends data shows that where these markers are used it is found that there is a heavy bias to use them in single-gender conversations, and generally females prefer their use more than males in single gender conversation. The use of almost all of these devices heavily outweighs occurrences in the BNC – an unsurprising fact given the broad genre of speech acts in the BNC. While the use of ‘sort of’ may be of comparable frequency, this is just 0.7 occurrences per thousand. The data shows that females prefer the use of ‘oh’ and ‘sort of’ almost twice as much as males and interestingly they have an overwhelming tendency to use these devices in the company of other women.
‘I think’, ‘Y’know’, ‘I guess’ and ‘Well’ are all used
approximately twice as much in single gender conversations as the converse regardless of gender. Rachel and Phoebe frequently use ‘oh’ as a single word interjections. Rachel uses it 50% more than Phoebe who uses it twice as much as any male character, these interjections are in effect back-channels (discussed later) as they do not denote an intention to speak. On the function of hedges, the data disagrees with Lakoff’s definition that hedges are women’s language.
50
Anglia Ruskin University MA Applied Linguistics with TESOL
Use of Hedges #1 180 Utterances per thousand
160
140 120 100
80 60 40 20
0 I think…
...Y'know…
I guess…
Well,…
Oh…
Female speaker - Female only conversations
Female speaker - Mixed sex conversations
Male speaker - Male only conversations
Male speaker - Mixed sex conversations
BNC Spoken Figure 16: Use of Hedges #1
Utterances per thousand
Use of Hedges #2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
I'm sure…
Could you… ….please?
I wonder…
…sorta/sort of…
Female speaker - Female only conversations
Female speaker - Mixed sex conversations
Male speaker - Male only conversations
Male speaker - Mixed sex conversations
BNC Spoken Figure 17: Use of Hedges #2
51
Anglia Ruskin University MA Applied Linguistics with TESOL
4.2
TAG QUESTIONS
Tag Questions
Category Positive stem, negative tag Negative stem, positive tag “, right?”
Male
Female
Same-sex
0.26
1.49
Mixed sex
0.66
0.55
Same-sex
2.64
2.09
Mixed sex
0.78
1.12
Same-sex
6.34
2.09
Mixed sex
2.71
1.67
Table 8: Tag Questions Results
Tottie & Hoffmann (2006) highlighted the 15 most frequent tags according to their data from the BNC and the Longman Spoken American Corpus (LSAC), priority was given to ensuring that the pattern matching expression retrieved these tags.
"To my knowledge there is no syntactic rule in English that only women may use. But there is at least one rule that women will use in more conversational situations than a man. This is the rule of tag-question formation" (Lakoff, 1973, p.53)
The whole premise behind studying tag questions is that are epistemic modal tags express uncertainty and unassertiveness and that in a “male dominated society, women are brought up to think of assertion, authority and forcefulness as masculine qualities which they should avoid” (Cameron et al., 1988, p.76). They have come to characterize women’s speech as highlighted in this quote by Lakoff above. Pedagogically Swan (2005) states that intonation is the key the determining the meaning, differentiating the difference that if we want to know something but are not sure of the answer we use a rising intonation; conversely if the tag is not a real question i.e. we are sure of the answer, we use a falling intonation . Dubois and Crouch have already faced the problem of identifying the different intentions of tag questions. In 1975 they set out to disprove Lakoff’s theory about tag questions and concluded it her hypothesis be invalid, in other words, they found that men used more of this particular structure
than
women
this
1001201
EDDIE
You had sex with her didn’t you?
conclusion on its own is rather enlightening however more information is needed as their study involved only 33 tag statements all by men in a business conference. If men are vocal on the topic of abortion then in a business environment one would expect also expect a greater tendency to talk. Holmes (1983) has also commented that men may be using this function in an assertive challenging way. 52
Anglia Ruskin University MA Applied Linguistics with TESOL
Tag Questions per thousand utterances Longman Spoken American Corpus BNC spoken (context-driven) BNC spoken (spontaneous)
BNC spoken (average) "Friends" female mixed conversations only "Friends" female conversations only "Friends" male mixed conversations only
"Friends" male conversation only 0
2
4
6
8
10
Tag Questions per thousand utterances Figure 18: Tag Questions
This data refutes the association between women and tag questions. Males use tag questions approximately twice as often as women in a single-sex environment and males’ use in mixed gender conversations is marginally more than females’ use. As can be seen from the graph the BNC spontaneous speech aligns closely to both the females single -gender use and the males mixed gender use providing evidence a reliable benchmark and allowing the conclusion that despite the studio setting this feature appears to have been used naturally.
Where males do use tag questions, they have a strong prejudice – approximately 2:1 in each conversation setting- to use the “,right?” tag instead of the conventional inverted auxiliary, and this is not a noticeable trait with the females . In these instances, the use of the “’right?” tag is used overwhelmingly as a negative tag with a positive stem (e.g. “There’s more beer, right?”). Of interest is that females do use positive stems with negative tags more than men, approximately 5 times as much although again by minimal amounts. For example:
1673801
Joanne to Rachel
…You didn’t tell him not to call me, did you?
53
Anglia Ruskin University MA Applied Linguistics with TESOL
Average per thousand utterances
Tag Questions Details 6.34
7 6 5
4 3 2
1.49
1.12
1.67
0.55
1
2.71
2.64
2.09 2.09
0.26
0.66 0.78
0
Female speaker, all female conversation
Female speaker, mixed-sex conversation
Positive stem, negative tag
Male speaker, all male conversation
Negative stem, positive tag
Figure 19: Tag Question Details
54
Male speaker, mixed-sex conversation Right?
Anglia Ruskin University MA Applied Linguistics with TESOL
4.3
ASKING QUESTIONS
Q's
Category Asked
Male
Female
Same-sex
345.7
306.2
Mixed sex
161.9
153.4
Table 9: Questions
Macaulay(2001) investigated the questioning strategies of 4 reporters, 2 male and 2 female, 1 working for CNN and 3 working for CBC (Canadian Broadcasting Company). She found that the strategies were the same between the genders across a total of 23 interviews.
Males
preferred direct questioning (40% & 41% vs. 35% and 35%) while female preferred indirect strategies (37% & 31% vs. 19%, 21%)
Coates investigating question use between the sexes wrote that "questions can be used to seek information, to encourage another speaker to participate in talk, to hedge, to introduce a new topic, to avoid the role of expert, to check the views of other participants, to invite someone to tell a story" (1996, p.176). She documented that the use of questions was different depending on the gender with women using far fewer true ‘information seeking’ questions than men. She noted that the “maintenance and development of friendship” (p.176) was the primary goal in asking question unlike men who use them in a more direct ‘tell me something I don’t know’ kind of way.
The regular expression which garnered these results is incredibly simple (see the Appendix), checking only for the existence of a question mark somewhere in the utterance. Numerous trials were undertaken but with auxiliaries being dropped and multi-word subjects/objects it was problematic to say the least. In essence any part of speech can be followed by a question mark in the correct context. Consider: a. Hot? (adjective phrase) b. A book? (noun phrase) c. Really? (adverbial phrase) Therefore these results rely wholly on the accurate transcription of the spoken dialogue, and questions from the canonically explicit to the ambiguously indirect are represented in this one statistic without differentiation.
22801
RACHEL
Guess what…? 55
Anglia Ruskin University MA Applied Linguistics with TESOL The rationale behind the importance of this statistic is that Lakoff bemoaned that asking questions represents a linguistic insecurity, “resulting from the oppression of women” (Fishman, 1978., p.400) while such propaganda might have had support twenty years ago today such a statement is laughable. Fishman observed that women ask almost 300% more questions than men. This is paradoxical since requests for information – typically realised by questions are, by definition, aggravating or face threatening. Fishman hypothesized that females' greater use is because of their attempt to solve the conversational problem of gaining a response to their utterances (see Rachel’s utterance on page 55). Questions such as this function as a conversational lubricant, greasing the wheels of small talk. Much like the child who has just discovered the phrase ‘…but why...?’; questions invariably “evoke further utterances” keeping the conversation alive (p.400). Freed and Greenwood (1996) found few differences with regard to the number of questions asked in conversation between the genders, however their sanitised equipment-laden interview room and the parameters of the 'spontaneous talk', 'considered talk', and 'collaborative talk' elements raises eyebrows.
The data which shows near ubiquity in a mixed-gender setting while it is males who are more inquisitive than their counterparts in single-gender interaction thus denying claim for Fishman’s hypothesis. However, the phrase “guess what…?” was more popular among the women by a normalised ratio of 3:2 and interestingly all instances of this phrase (of which there were just 20) occurred in mixed-gender conversations.
Questions per thousand utterances
Do men ask more questions? 350
300 250 200
150 100 50
0 Male
Female Same-sex
Figure 20: Asking que stions
56
Mixed sex
Anglia Ruskin University MA Applied Linguistics with TESOL
4.4
TABOO LANGUAGE
Taboo
Category Top 10 taboo words
Male
Female
Same-sex
24.05
47.36
Mixed sex
22.04
42.27
Table 10: Taboo words Results
Searching the corpus looking for specific taboo words is a troublesome area; given that we have more words for boobs than Eskimo’s have for snow (Pullman, 1989) it was a relief to discover that ten words have been responsible for 80% of all swearing consistently over the last two decades (Jay, 2009), and this vocabulary provided a good starting point to investigate taboo language in the corpus. While it will come as no surprise that the corpus is void of such offensive words as ‘shit’, ‘fuck’, ‘hell’ and ‘Jesus Christ’ the other six lexical items were all present with varying degrees of frequency.
Table 10 & 11 are irrefutable and show that women outright outperform men when it comes to the use of profanities. Indeed the most profuse male (Ross) is barely half as impolite as the least distasteful female. Potty-mouth Rachel is by far the worst offender and women seem as content using such language in the company of their male peers as they do with their own kind. Actor RACHEL PHOEBE MONICA ROSS CHANDLER JOEY
Gender F F F M M M
Frequency per 1000 38.19 31.10 28.07 16.72 16.61 14.76
Table 11: Who uses taboo language?
Important to note is that the frequencies in Table 12 (on page 58) have been normalized per actor per scene. For example: Rachel contributes a combined 1015 utterances during all of her involvement in all singlegender scenes. In these scenes she utters 75 expletives giving her an frequency of [almost] 74 instances per thousand utterances in single gender conversations. In contrast (Table 11) she contributes a grand 352 swear words in a 9217 utterance show total giving her a frequency of 38.19 per thousand utterances regardless of conversation parameters.
57
Anglia Ruskin University Actor
Gender
MA Applied Linguistics with TESOL Scene Rudeness Interaction
RACHEL
F
F
73.89
RACHEL
F
M,F
50.09
PHOEBE
F
M,F
46.22
MONICA
F
F
45.95
MONICA
F
M,F
38.80
PHOEBE
F
F
32.69
ROSS
M
M,F
24.64
JOEY
M
M
24.62
CHANDLER
M
M
24.00
CHANDLER
M
M,F
22.73
JOEY
M
M,F
19.96
ROSS
M
M
18.36
Frequency per 1000
Table 12: When is taboo language used?
Women are, far and away, the ‘bluest’ when it comes to using taboo language there is a very clear distinction between them and the men, and at an utterance level men are almost shy to utter a taboo word: 303401
ROSS
Alright, alright. We're all adults here, there's only one way to resolve this. Since you saw her boobies, I think, uh, you're gonna have to show her your peepee.
The usage of “god” is ‘off the chart’ to the extent where its frequency far exceeds that of its nearest rival (see Figure 21 & 22) and while Phoebe has a soft spot to use ‘ass’ Joey appears to have a penchant to exclaim using ‘hell’. The frequency of “god” in the friends corpus dwarf the number of occurrences in the BNC spoken subset – by a factor of more than 100 compared to Rachel. The other 5 words also appear more frequently in Friends than they do the BNC.
58
Anglia Ruskin University MA Applied Linguistics with TESOL
Frequency per thousand utterances
God 43.94 45 40
33.59
33.78
35
30 25 20
17.83
14.81
12.05
15 10
0.42
5
0 Chandler
Joey
Ross
Monica
Phoebe
Rachel
BNC Spoken
Figure 21: Taboo language per actor 1
Frequency of taboo words per actor 5
Freuqency per thousand utterances
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
Damn Chandler
Hell Joey
Ross
Ass Monica
Bitch Phoebe
Rachel
Sucks BNC Spoken
Figure 22: Taboo language per actor 2
Courtney Cox, one of the 6 main actors was the first actress to use the taboo word “period” on US television in a Tampax advert in 1985 (Wikipedia, 2011b). Fittingly she is the only female character to utter the same word [in the intended context] throughout the show. Although it is not just in this instance that women push the boundaries. 59
Anglia Ruskin University MA Applied Linguistics with TESOL When ‘boobs’ are used with a pre-modifying adjective it is the females who are more creative. Men predictably obsess about size while women are more critical although almost all connotations are positive. Firth (1935 cited in O’Keeffe et. al., 2007, p.59) argued that the meaning of a word is as much a matter of how it combines with other words (i.e. its collocations) as its own meaning and such collocates have Whorfian implications for those who watch the show and for society generally. No similar statistics are presented for parts of the body as there was no data to support this.
Pre-modifying adjective
% (gender)
Big/bigger/biggest
20% (f), 40% (m)
New
20% (f)
Fake
10% (f)
Nice
10% (f)
Table 13: Collocates to the left of 'boobs'
60
Anglia Ruskin University MA Applied Linguistics with TESOL
4.5
EMPTY ADJECTIVES
Adjs.
Category Empty
Male
Female
Same-sex
59.46
60.17
Mixed sex
33.92
32.65
Table 14: Empty Adjectives Results
For the purpose of this study an empty adjective has been defined to represent some abstract property. Whereas ‘hot’ or ‘wet’ are concrete adjectives, independently verifiable as to their existence, so-called empty adjectives are subjective or more subjective than physical properties. The empty adjectives analyzed in this study are limited to: Gorgeous
Wonderful
Divine
Pretty
Lovely
Great
Good
Fantastic
Charming
Sweet
Adorable
Cool
This list was taken from Lakoff (2004, p.45)
Empty Adjectives #1
Utterances per thousand
30 25
20 15 10 5
0 pretty
good
cool
great
Female speaker - Female only conversations
Female speaker - Mixed sex conversations
Male speaker - Male only conversations
Male speaker - Mixed sex conversations
BNC Spoken Figure 23: Empty Adjectives
61
Anglia Ruskin University MA Applied Linguistics with TESOL
Empty Adjectives #2 Utterances per thousand
1.2 1 0.8 0.6 0.4 0.2 0 gorgeous
wonderful
lovely
fantastic
charming
adorable
Female speaker - Female only conversations
Female speaker - Mixed sex conversations
Male speaker - Male only conversations
Male speaker - Mixed sex conversations
BNC Spoken Figure 24: Detailed view of negligible 'empty' adjectives
Much like the data for ‘hedges’, the results per scene type are remarkably similar with a clear single gender bias. Both genders use the same subset of empty adjectives with the highest frequency (the four in Figure 23), and it is interesting to note that it is males who make more use of ‘pretty’ regardless of the company a trend which applies to ‘good’ and ‘cool’ also. The qualification
here
is
that
498501
JOEY
Joe Stalin. Y’know, that’s pretty good.
Americans use ‘pretty’ like we British would use ‘rather’ (Longman, 2011) and this would also explain the low frequency of ‘pretty’ in the BNC. As a rule both genders use more empty adjectives in the company of their own gender but overall it was males who make slightly more use of these parts of speech than women. There is no clear support for Lakoff’s assumptions.
A further caveat in this data is the reported speech aspect. For example (at ID 4417801) Joey utters “…he said he thought you were charming”. Here Joey is reporting what someone else thought not what he thought. This instance was identified when validating the data and it was the only instance which was identified. While the reported speech is of a male and the speaker a male this is purely coincidental, and excluding reported speech items such as this from the data was not attempted.
62
Anglia Ruskin University MA Applied Linguistics with TESOL
4.6
INTENSIFIERS
Intensifiers
Category That's so cool! That's very cool That's really cool
Male
Female
Same-sex
12.95
35.15
Mixed sex
17.17
31.83
Same-sex
5.81
6.95
Mixed sex
7.41
7.65
Same-sex
13.21
22.73
Mixed sex
16.3
20.26
Table 15: Intensifiers Results
Linguistically there is little difference between saying ‘I like you very much’ or ‘I like you so much. Both devices are very similar, they are intensifying degree adverbs and as such their use is interchangeable. The theory behind the use of ‘so’ is that it evades declaring ones strong feelings, a reserved form of ‘very’ where one does not dare make it clear how passionately one feels are; using 'so' Lakoff states "weasels [out] on that intensity" (2004, p.55). The data supports the preferred use of ‘so’ and it is more than twice as prevalent than ‘very’ in males’ speech while females prefer ‘so’ to ‘very’ by a normalised ratio of almost 4:1. Females also use 'so' almost 3 times as much as men in single gender conversation while they use it almost twice as much as men in mixed gender conversation. The usage of 'very' has close parallels and ‘really’ is, again, preferred by women regardless of conversation type. ‘Really’ and ‘so’ are the first two constructs which could reasonably be considered ‘women’s language’. ‘Really’ however is a slightly different construct (a true adverb) while ‘so’ and ‘very’ are degree adverbs so true comparisons are unfair but the point remains that women use it more than men.
The data facilitated an easy breakdown of the use of these features by gender and by time; it is interesting to see the patterns of usage (see Figure 25 on the next page). The ebbs and flows in usage cannot be easily explained, the show features many weddings and the birth of a number of children however the diversity of these trends is too broad to categorize as being emotionally related to these events. Tagliamonte & Roberts (2005) also investigated this and also failed to correlate the use with anything meaningful, the best they could do was to highlight the parallels between the viewing figures and emphatic ‘so’.
63
Anglia Ruskin University MA Applied Linguistics with TESOL
Use of Intensifiers 46 Occurances per thousand
41 36 31 26 21 16 11
6 1 0
2
4
6
8
10
12
Season Male: so + adj.
Female: so + adj.
Male: very + adj.
Female: very + adj.
Male: really + adj.
Female: really + adj.
Figure 25: Diachronic results use of m ain elements over a 10 year period.
The females’ use of ‘so’ and ‘really’ is almost untouched by men’s use. Only in season 9 do men start to use ‘so’ as much or more than women use ‘really’. Where females do use ‘so’ in an utterance it was discovered that they were three times more likely to repeat the use again in the same utterance. E.g. 5866401
RACHEL
This is so awesome. College guys are so cute
Although this pattern is not consistent: 5908101
MIKE : Phoebe you're so beautiful. You're so kind, you're so generous. You're so wonderfully…
And females are also solely responsible for the only two intensifier utterances 1337201
RACHEL
I’m so dead sorry
5441401
MONICA
I’m so so sorry*
(* more repetitions of ‘so’ possibly removed as detailed in section 3.1)
64
Anglia Ruskin University MA Applied Linguistics with TESOL
4.7
VOCABULARY
The CANCODE (Cambridge and Nottingham Corpus of Discourse in English) is a five million word spoken English corpus. It provides a good benchmark against which to view the ‘Friends’ data. #
Friends corpus11
1
I don't know
2
I know I
571
A lot of
574
3
I mean I
548
I mean I
437
4
I can't believe
434
I don't think
435
5
I think I
346
Do you think
302
6
I don't think
336
Do you want
285
7
I have to
320
One of the
266
8
I love you
288
You have to
260
9
Don't know I
265
It was a
255
10 Know I know
259
You know I
246
per million 1,027
CANCODE12
per million
I don't know
1,062
Table 16: Top 10 three-word clusters
First and foremost these results (Table 16) provide a great deal of validity to the corpus and there are numerous similarities (marked) both in terms of actual phrases and the alignment of the frequencies. Such evidence provides a great deal of proof for Pawley and Syder’s (1983) assertion that we, as native speakers, share a common core of vocabulary and of prefabricated sequences and lexical bundles are seen as the “basic building blocks of discourse" (Biber et. al., 2004, p.271). There is an argument that these are grammatical structures and not off-theshelf fillers such as ‘if you ask me’ but the point remains that communicative competence is represented by giving the correct responses at the correct time and these chunks serve to lessen the communicative burden on the viewer allowing them to greater appreciate the other content while sharing this common schema of vocabulary and rules. This harmonious speakerviewer relationship allows the viewer to apply their knowledge to the performance and not to the language being used. Socio-culturally the most striking discrepancy between the two corpora is the occurrence of ‘I love you’ in the Friends corpus which occurs almost two hundred times and which the distribution is almost identical between the genders.
11 12
th
Produced using WordSmith, http://www.lexically.net/wordsmith/ (Accessed 28 May 2011) Data taken from McCarthy (2006) and normalized.
65
Anglia Ruskin University MA Applied Linguistics with TESOL Male
Female
Any
10.34
10.26
Same-sex
11.23
10.79
Mixed sex
10.33
10.26
Any
10.34
10.29
Same-sex
11.7
11.36
Mixed sex
10.81
10.73
Approx. size
N/A
326,486
314,627
Base List 1
N/A
87.71%
87.49%
Base List 2
N/A
2.94%
2.81%
Base List 3
N/A
1.29%
1.31%
N/A
8.06%
8.39%
Mean length of utterance (words)
5% trimmed mean (words)
Vocab.
Averages
Category
Not in 1,2 or 3 Table 17: Vocabulary Results
The table above shows that males’ utterances are statistically always longer and in their total vocabulary larger although both genders appear to use the same main subset of language for the vast majority of their speech - just 1000 words cover almost 90% of all speech by either gender. One reason for a smaller female vocabulary could be because women take fewer risks with their vocabulary although breaking down the vocabulary used by word list (Figure 26 below) shows that while women’s vocabulary is smaller at a word-list level, this is only true by insignificant yet consistent values.
Vocabulary Breadth Vocabulary per word list (maximum 1000)
Male
Female
973 969 855
816
682
637
493 456 376 361
295 260
208 196
170 160
Base List 1 Base List 2 Base List 3 Base List 4 Base List 5 Base List 6 Base List 7 Base List 8
Figure 26: Vocabulary Breadth
66
Anglia Ruskin University MA Applied Linguistics with TESOL At an actor level there are some important differences. Despite speaking the most Rachel’s vocabulary is the second smallest this gives her the lowest ‘innovation’ score of all the characters. Conversely Chandler who speaks 12% less than Rachel has a broader vocabulary by a difference of some 14% giving him the most varied vocabulary. Phoebe who speaks the least appears to be very inventive with her utterances registering an ‘innovative’ score of 6.4. According to Eckert (1989) Rachel might be speaking the most, not because she has the most to say but because she has the most work to do re-affirming relationships, this is purely speculatory and only the purest form of conversational analysis would affirm such a claim. The girls and Joey have the narrowest vocabulary by unique words, in contrast Chandler and Ross are some 500 unique words ahead of these other four characters. This variety helps to give Chandler the greatest ‘innovative score’ of all 6 actors, a metric which marks him as the most creative. The boys have an average vocabulary of 5,776 unique words while the girls are more than 450 words behind on 5,314 meaning the boys have a 9% broader vocabulary, this is despite speaking only 2% more than the girls. Therefore the data clearly shows that the men in this study do have a bigger vocabulary than the women. Person
# of
# of spoken # of unique words ‘Innovative’ Score 14
utterances
words
CHANDLER
8,370
91,355
6,001
6.57
PHOEBE
7,461
86,497
5,539
6.40
JOEY
8,131
91,731
5,414
5.90
MONICA
8,335
87,269
5,148
5.90
ROSS
9,031
100,776
5,914
5.87
RACHEL
9,217
102,707
5,256
5.12
Table 18: Unique words per actor (ordered by ‘innovative’ score)
Lakoff (2004) claims that while woman might say a colour is mauve while a man will call it light purple. Emphatically there was no support for this claim. Both genders used the same 12 word subset for the majority of their colours (black, blond, blue, brown, gray, green, pink, purple, red, tan, white, yellow). To this list females added three colours: gold, olive, orange; while males added a different three: amber, maroon, silver. The freque ncy of these colours was so low that each instance was inspected to ensure that it was used in the correct context. Hence there is no support for the belief that women are more descriptive with regards to colour. The only mildly interesting statistic from this was that men’s use of the colour red was more than two times that of the women’s use (26 vs. 12). Further analysis revealed a bias with
14
=(# of unique words/# of words) * 100
67
Anglia Ruskin University MA Applied Linguistics with TESOL regards to collocates, with females much more likely to describe something as little and/or cute while men were likely to describe something as big or huge:
Male collocates with colours Big Huge Little Pretty Middle-aged Long-stemmed Wavy
Female collocates with colours Little Cute Stunning Bright Favourite Big Pretty
Table 19: Collocates with colours
Women, of course, have a large stock of words related to their specific interests, generally regarded to them as 'woman’s vocabulary’, similarly men use vocabulary which is semantically masculine. Women's Vocabulary sexier sympathy arrogant craving prince iron integrity groceries cries headache inviting mild elegant poop kittens swelling long-term apologizing contraction g-string cramp relatives
Men's Vocabulary courtside independent graduate threesome freedom champion neurosurgeon wrecking hockey legitimate brutal maniac sport playstation laughs swearing silliness scored razor valuable slack cars
Table 20: Gender specific vocabulary (not a definitive list)
Del-Teso-Craviotto (2006) analysed the vocabulary in four women’s magazines for the semantic categories used to approach their female readers. There are many parallels between her study and this. Just as women’s magazines try to emulate what their editors suppose is 68
Anglia Ruskin University MA Applied Linguistics with TESOL the language of their readership, presumably so too have the editing staff of Friends. Addressing women [viewers/readers] with casual but appropriate language, allows the characters/magazines to present themselves as friends (del-Teso-Craviotto, 2006). The high degree of overlap in the genders vocabulary also signals a shared communal concern for one another’s problems while individual gender-specific vocabulary is presented incorporating both traditional and progressive elements . I propose that part of the appeal of the show was that women were seen as progressive and the choice of vocabulary was integral to this ideology. Cixous (1975) stresses that many binary oppositions are gendered, with men associated with activity, culture, the head and rationality, whereas women are associated with passivity, nature, the heart and emotionality. Certainly such claims have been validated given the preceding groups of vocabulary however quantifying such claims more accurately would have been made easier with this semantic tagging (discussed in the conclusion).
69
Anglia Ruskin University MA Applied Linguistics with TESOL
4.8
BACK CHANNELLING
Tottie (1991) defines backchannels as “the sounds (and gestures) made in conversation by the current non-speaker, which grease the wheels of conversation but constitute no claim to take over the turn” (p.255). ID
Person
Utterance
109901
ROSS
…say sevenish?
110001
RACHEL
Sure.
353001
JOEY
You want to see her again, right?
353100
ROSS
Sure.
Table 21: Not back channelling
No statistics are presented for back channelling as identifying them was exceedingly difficult. Table 21 above shows two examples of what are clearly not back channels but simply responses to previous utterances. The issue arises with knowing if an utterance is a response to a question or a signal of attention neither of which can be easily ascertained without a level of interpretive annotation. Table 22 below shows an example of back channelling which was identified manually. In this instance by Rachel. ID 49301 49401 49501 49601 49701 49801 49901 50001 50101 50201 50301
Person RACHEL BARRY RACHEL BARRY
Utterance And you've got lenses! But you hate sticking your finger in your eye! Not for her. Listen, I really wanted to thank you. Okay. See, about a month ago, I wanted to hurt you. More than I've ever wanted to hurt anyone in my life. And I'm an orthodontist. RACHEL Wow. BARRY You know, you were right? I mean, I thought we were happy. We weren't happy. But with Mindy, now I'm happy. Spit. RACHEL What? ROBBIE Me. RACHEL Anyway, um, I guess this belongs to you. And thank you for giving it to me. BARRY Well, thank you for giving it back. ROBBIE Hello?!
Table 22: Rachel back channels (scene #24)
70
Anglia Ruskin University MA Applied Linguistics with TESOL
5. CONCLUSIONS This project set out, as I am sure many do, with lofty goals. Without doubt the benefits of a computerized corpus of speech is a valuable asset and what computers have always done flawlessly is objective decision making based on the parameters specified.
This study much like many before it has suffered from a “methodological weakness” (Holmes, 1986, p.4) with a tendency to simply quantify linguistic structures in the data with little regard to the context of the items, nor attention to the functional correlations of such use. The example of the function of tag questions is just one such example of this weakness. Syntactically defining all of the possible variations needed to extrapolate such instances from a corpus, while not impossible, is a never-ending task and further classifying such instances based on their pragmatic meaning is fraught with problems. [George] Lakoff came to the same conclusions when he lamented that “natural language concepts have vague boundaries and fuzzy edges; …consequently, natural language sentences will very often be neither true, nor false, nor nonsensical, but rather true to a certain extent and false to a certain extent, true in certain respects and false in other respects” (Lakoff, 1973, p.458). Probing tag questions further, is “isn’t it?” the same as “, right?” the same as “, okay?”? E.g.: 1)
You’ll do it, won’t you?
2)
You’ll do it, right?
3)
You’ll do it, okay?
To a native-speaker’s ear the third sentence sounds not like a question proper but like an imperative and with different intonation each sentence could be portrayed similarly. All of these caveats, qualifications and assumptions have, unfortunately, left the data this study has presented with some issues.
Given the data as it stands, there is certainly a difference between inter-group and intra-group speech but this study has not found any meaningful data to support Lakoff’s claims about women’s speech. There is no support that women use more ‘women’s language’ and the view that the stereotypes of women being linguistically restricted is not upheld, although it has been shown that while men generally have a broader vocabulary and use it more creatively. Despite this, from the use of intensifiers to taboo words women are a country mile ahead of masculine counterparts, while in other features, from empty adjectives to the number of questions asked, colours to hedges, both genders have been found to operate on remarkably similar levels.
The results that people alter their conversational strategy based on the 71
Anglia Ruskin University MA Applied Linguistics with TESOL gender(s) of their partner(s) is not new (e.g. Boulis & Ostendorf, 2005) and that theory has been upheld with this study. Social identities arising from memberships of the same or different communities of practice (McConnell-Ginet, 2003) may begin to explain both the discrepancies and the alignments. A community of practice is defined as a group of people “brought together by some mutual endeavour, or common enterprise… and to which they bring a shared repertoire of resources, including linguistic resources, and for which they are mutually accountable.” (McConnell-Ginet, 2003, p.71, emphasis added). Therefore there is a responsibility of all involved in the conversation to adhere to the cohesion and fluency of the ensuing conversation. It is also understandable that a group of friends are linguistically similar and that, from Eckert & McConnell-Ginet’s (1999) Asian Wall to the sofas of Central Perk, these characteristics may be part of the glue which holds friendships together.
72
Anglia Ruskin University MA Applied Linguistics with TESOL
6. LIMITATIONS OF THE STUDY The corpus is already a valuable asset however a near infinite number of improvements can be suggested. Phonetic representation, a consistent and detailed level of annotation are aims which, as many corpus linguists have attested to, are unfeasible on a wide range of levels.
Francis and Hunston’s definitions of ‘the acts of conversation’ (1985) are detailed and comprehensive. Acts come together to form ‘moves’ and these again have members (eliciting, answering) to have had a corpus with anything near this level of annotation would have both validated the results to a greater extent and opened up so many more doors. While it is appreciated that this is a manual and highly subjective area susceptible to wide degrees of error any consistently applied framework would have been beneficial.
Critical discourse analysis is as interested in what is not said as what is said and while the ability to programmatically retrieve occurrences of ‘so + adjective’ structures is useful, what a corpus cannot easily tell us is where such structures could exist but don’t. Frameworks such as Halliday’s SFL can be applied methodically although the amount of information harvested becomes insurmountable with only a dissection of a single-paragraph advert. Invariably being forced to look at each utterance as an atomic element was by far the biggest drawback of this project. “Even if a case could be made for the autonomous treatment of some aspects of the language, discourse cannot be satisfactorily analyzed in a vacuum” (Lakoff, 2001, p.200). Corpus annotation has the scope to make such assumptions concrete however the level of annotation needed to satisfactorily while small on an utterance level becomes prohibitive in a seventy thousand utterance corpus.
With reference to comparisons to British television, it was outlined in section 1.3 that few studies into British TV specifically have been done. Few British comedies exist which have been transcribed (either officially or by a faithful fan base) or which have run for 200+ episodes. Retrospectively it is with relief that this avenue was not accomplishable, putting together one corpus was difficult enough and the invariable inconsistencies in transcription, I anticipate, would have caused numerous problems. The value of the study would have also been of questionable use using not just one scripted programme but two.
73
Anglia Ruskin University MA Applied Linguistics with TESOL
6.1
SEMANTIC ANNOTATION
Semantic tagging is starting to come of age, and Lancaster’s semantic tagger already has a decade of history behind and its large scale pilot in the ICE project may just give academics the incentive to investigate and push its limits. It is easy to see how such an option would be useful (if you are in any doubt please consult the query which was used to glean the colours the genders use from the corpus, found in the appendix).
Coupling the power of a
grammatical representation with a semantic representation could be easily accommodated – essentially another column in the database (see Figure 27 for a mock query). To all intents and purposes it would have provided the best of both worlds where should grammatical functions be the focus the POS mark-up could be interrogated or should genres of vocabulary be the focus the semantic mark-up could be interrogated as in this example? Utterance: I like a particular shade of lipstick
Grammatical mark-up: I_PPIS1 like_VV0 a_AT1 particular_JJ shade_NN1 of_IO lipstick_NN1
Semantic mark-up: I_Z8 like_E2+ a_Z5 particular_A4.2+ shade_O4.3 of_Z5 lipstick_B4 19
As an example E2+ signifies the word belongs to the category `emotional states, actions, events and processes' (E), subcategory `liking and disliking' (E2), and refers to `liking' rather than `disliking' hence ‘E2+’. Grouping by semantically related sets (e.g. ‘lipstick’ belongs to the ‘cleaning and personal care’) would have opened up more opportunities to explore, at a lexical level, the language the genders use both inter-gender and intra-gender. It is somewhat regrettable that this option was not fully explored.
select id, person, gender, line from friends # match structure 'pronoun + conj. + noun' e.g. "I like tennis" where GrammaticalPOSData regexp ".*PPIS1.*CS.*NN.*" # refine to match only food nouns e.g. "I like sushi". and SemanticPOSData regexp ".*Z8.*E2+.*F1.*”
Figure 27: Mock query to interrogate both grammatical and semantic representations.
19
Full tagset available at: http://ucrel.lancs.ac.uk/usas/semtags.txt (Accessed 12th May 2011)
74
Anglia Ruskin University MA Applied Linguistics with TESOL The process of getting semantically tagged data would have paralleled the steps taken to get grammatically tagged data and far from being a manual process at an utterance level (like the frameworks of Halliday, Francis & Hunston) the process is entirely automatic requiring the analyst to simply ‘marry’ the produced output with the input and then import the data into the database.
6.2
CONNECT BY PRIOR
Databases can be fickle systems while they make life easy on the one hand, they complicate things on the other. There has been only one area where the database has been a hindrance rather than a help and that is in relating utterances to one another.
Oracle and other
competing commercial relational database systems offer a function whereby one can query a row based on the row prior to it. It is called ‘connect by prior’, and in essence it allows something like this pseudo-code example: Give me all the rows where the row is just a one word utterance but where the row prior to it doesn’t end in a question mark.
Fundamentally there are sound reasons why this is not allowed, database management systems are flat and typically transactional; rows are either queried in isolation or by column and never in relation to one-another. The absence of this function posed problems when trying to ascertain the amount of back channelling which occurred. It was anticipated that there wouldn’t be much – due to the studio format – a camera change/cut away for a one word non-interruption was considered unlikely however one word responses existed in abundance in the corpus and being able to even to spot check a handful of them manually could have been useful. Even with this facility, the problem of identifying back channelling is still not simply as knowledge of the next speaker does not necessarily denote that the two utterances are related. This issue aside, the process of identifying them would have been one step closer.
While the regular expressions are my best attempt there are fundamental problems with the retrieval of such data. Stubbs (1996) analysed two very short texts; one for boys and one for girls. The frequency of ‘happy’ and ‘happiness’ were similar in both texts, suggesting an equal importance. However, through detailed analysis Stubbs found that the speech to boys was instructed them to live happy lives, whereas the speech to girls was telling them to make other people happy.
Similar results are possible in this study and (again) only a detailed
conversational analysis would have avoided these problems. 75
Anglia Ruskin University MA Applied Linguistics with TESOL
7. RECOMMENDATIONS FOR FURTHER STUDY The possibilities for further study using this corpus are broad.
MacFadden et. al. (2006)
compiled a TV word list and computed the necessary vocabulary needed for comprehension and this is one possible area. Semantic tagging is still in its relative infancy however it is available for academic use, this is probably the most potent area of future research.
The level of politeness is inseparably related to the social distance between the two (or more) parties, and the greater the social distance the higher the degree of linguistic respect is likely to be expressed (Wolfson, 1998) therefore it would have interesting to know to whom polite language is directed and the effect a ‘stranger’ (none of the main 6 actors) has on the politeness of the language.
This is an interesting area if study although methodical
conversational analysis would be required.
76
Cameron, D. 2005. Language, Gender and Sexuality: Current Issues and New Directions, Applied Linguistics, 26(4), pp. 482-502
8. BIBLIOGRAPHY
Cameron, D., McAlinden, F., O’Leary, K. 1988. Lakoff in context: the social and linguistic function of tag questions, in Coates, J. & D. Cameron (eds.), Women in their speech communities. London: Longman, 74-93.
Allan, K., Coltrane, S. 1996. Gender Displaying Television Commercials: A Comparative Study of Television Commercials in the 1950s and the 1980s, Sex Roles, 35(3/4), pp. 185-203
Chamberlain, J., Poesio, M., Kruschwitz, U. 2008. Phrase Detectives: A Web-based Collaborative Annotation Game, [online] Available at
Atkins, S., Clear, J., Ostler, N. 1992. Corpus Design Criteria, Literary & Linguist Computing, 7(1), pp. 1-16 Baker, P., Hardie, A., McEnery, T., Xiao, R. Bontcheva, K., Cunningham, H., Gaizauskas, R., Hamza, O., Maynard, D., Tablan, V., Ursu, C. Jayaram, B. D., Leisher, M. 2004. Corpus Linguistics and South Asian Languages: Corpus Creation and Tool Development, Literary and Linguist Computing, 19(4), pp. 509-524.
Cheng, W. 2004. Some Preliminary Findings from a Corpus of Spoken Public Discourses in Hong Kong, Language and Computers, 18, pp. 35-52
Baker, P., Lie, M., McEnery, T., Sebba, M. 2000.The Construction of a Corpus of Spoken Sylheti, Literary and Linguistic Computing, 15(4), pp.421-431
Chicago Tribune. 2009. Friends finale is decade's mostwatched TV show [online] Available at
BBC, 2001, Anne Robinson: TV's rudest woman?, [online]
Chomsky, N. 1962. A transformational approach to syntax. In Archibald Hill (ed.), Proceedings of the third Texas conference on problems of linguistic analysis in English. Austin: University of Texas, pp. 124–58.
Beattie, G. W. 1982. Turn-taking and interruption in political interviews: Margaret Thatcher and Jim Callaghan compared and contrasted, Semiotica 39(1/2), pp. 93-114.
Chomsky, N. 1964. The Development of Grammar in Child Language: Discussion, Monographs of the Society for Research in Child Development, 29(1), pp. 35-42.
Biber, D., Conrad, S., Cortes, V. 2004. If you look at...: Lexical Bundles in University Teaching and Textbooks, Applied Linguistics 25(3), pp. 371-405
Chomsky, N. 1965. Aspects of the Theory of Syntax, Mass: MIT Press.
Biber, D., Conrad, S., Reppen, R. 2006. Corpus Linguistics: Investigating Language Structure and Use, Fifth Edition, Cambridge University Press
Cixous, H. 1975. ‘Sorties.’ In H. Cixous and C. Clément (eds) La Jeune Née. Paris: Union Générale d’Editions, English translation in E. Marks and I. de Courtivron (eds) (1980) New French Feminisms: An Anthology. Amherst, MA: University of Massachussetts Press, pp. 90–98.
Biber, D. 1993. Representativeness in Corpus Design, Literary and Linguistic Computing, 8(4), pp. 243-257
Coates, J. 1986. Women, men and language: Sociolinguistic Account of Sex Differences in Language.
Blythe, H., Sweet, C. 1983. Using Media to Teach English, Instructional Innovator, 28(6), pp.22-24.
A
Coates, J. 1996. Women Talk: Conversation between Women Friends, Blackwell.
Boulis, C., Ostendorf, M. 2005. A quantitative analysis of lexical differences between genders in telephone conversations, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 435-442
Coates, J. 2004. Women, Men and Language, Pearson Longman, 3rd Edition. Cook, G. 1990. Transcribing infinity: Problems of context presentation, Journal of Pragmatics, 14(1), pp.1-24.
Browne, B. A. 1998. Gender Stereotypes in Advertising on Children's Television in the 1990s: A Cross -National Analysis, Journal of Advertising, 27(1), pp. 83-96. Brown, P. Levinson, S. 1987. Politeness: Some Universals in Language Usage, Cambridge University Press.
Cook, G. 1995. Theoretical Issues: Transcribing the Untranscribable. In Leech, G. N., Myers, G., Thomas, J (Eds.). Spoken English on Computer: Transcription, MarkUp, and Application, pp. 35-53.
Bucholtz, M. 2004. Foreword, In (Ed.) R. Lakoff, Language and Woman's Place: Text and Commentaries, pp. 5-16.
Coulthard, M., Montgomery, M., M. 1981. Developing the description, Studies in Discourse Analysis, pp. 13–30.
Bucholtz, M. 2000. The Politics of Transcription, Journal of Pragmatics 32, pp. 1439-1465.
Coulthard, M. 1995. The significance of intonation in discourse, In: M. Coultard, ed. 1995. Advances in Spoken Discourse Analysis, Routledge, Chapter 2.
Burke, P. J., Tully, J. C. 1977. The Measurement of Role Identity, Social Forces, pp. 881-897.
CNN. 1998. President Bill Clinton, [online] Available at
Burnard, 1996, How to build a corpus [online]
Crowdy, S. 1994. Spoken Corpus Transcription, Literary & Linguistic Computing, 9(1), pp. 25-28.
77
Anglia Ruskin University MA Applied Linguistics with TESOL Crowdy, S. 1995. The BNC spoken corpus. In G. Leech, G. Myers, J. Thomas (eds.) Spoken English on Computer, Chapter 19, pp. 224-234.
Francis, G., Sinclair, J. 1994. 'I Bet He Drinks Carling Black Label': A Riposte to Owen on Corpus Grammar, Applied Linguistics, 15(2), pp. 190-200.
Crystal, D. 2009. A dictionary of Language, 2nd Revised Edition, University of Chicago Press.
Fraser, B. 1990. An approach to discourse markers. Journal of Pragmatics 14(3), pp. 383–95.
Cutting, J. 2002. Pragmatics and Discourse: A resource book for students. London: Routledge.
Fraser, B. 1999. What are discourse markers? Journal of Pragmatics, 31(7), pp. 931-952.
de Klerk, V. 1992. How Taboo Are Taboo Words for Girls? Language in Society, 21(2), pp. 277-289.
Freed, A. F., Greenwood, A. 1996. Women, Men, and Type of Talk: What Makes the Difference? Language in Society, 25(1), pp. 1-26.
del-Teso-Craviotto, M. 2006. Words that matter: Lexical choice and gender ideologies in women’s magazines, Journal of Pragmatics, 38, pp. 2003–2021.
Furnham, A., Bitar, N. 1993. The stereotyped portrayal of men and women in British television advertisements, Sex Roles, 29(3-4), pp.297-310.
Demme, J. E. J. 2009. Charmed and chattering tongues: Investigating the functions and effects of key word clusters in the dialogue of Shakespeare's female characters, [online] Available at
Gao, G. 2008. Taboo Language in Sex and the City: An Analysis of Gender Differences in using Taboo Language in Conversation [online] Available at
Deuchar, M. 1988. A pragmatic account of women's use of standard speech, in Coates, J. & D. Cameron (eds.), Women in their speech communities. London: Longman, 27-32
Garside, R., Leech, G. McEnery A. 1997. Corpus annotation: Linguistic information From Computer Text Corpora, Longman.
Drass, K. A. 1986. The Effect of Gender Identity on Conversation, Social Psychology Quarterly, 49(4), pp. 294301.
Glascock, J. 2001. Gender Roles on Prime-Time Network Television: Demographics and Behaviours, Journal of Broadcasting & Electronic Media, pp. 656-669.
DuBois B. L., Crouch, I. 1975. The question of tag questions in women's speech: they don't really use more of them, do they? Language in Society 4, pp. 289-294. DuBois, J. W. 1991. Transcription Design Principles for Spoken Discourse Research, Pragmatics, 1(1), pp. 71-106.
Gibson, E. K. 2009, Would you like manners with that? A study of gender, polite questions and the fast food industry, Griffith Working Papers in Pragmatics and Intercultural Communication 2(1), pp.1-17
Eckert, P. 1989. The whole woman: Sex and gender differences in variation, Language Variation and Change, 1, pp. 245-267.
Ginsburg, D. 2004. Friends Ratings [online] Available at
Eckert, P., McConnell-Ginet, S. 1999. New generalizations and explanations in language and gender research, Language in Society 28, pp. 185–201.
Goffman. E. 1955. On Face-work: An Analysis of Ritual Elements of Social Interaction, Psychiatry: Journal of the Study of Interpersonal Processes, 18(2), pp. 213-231.
Eisikovits, E. 1991. Variation in subject-verb agreement in Inner Sydney English, In J. Cheshire (ed.) English Around the World: Sociolinguistic Perspectives, Chapter 16, pp. 235255.
Green, J., Franquiz, M., Dixon, C. 1997. The Myth of the Objective Transcript: Transcribing as a Situated Act, TESOL Quarterly, 31(1), pp. 172-176 Google. 2011. RECAPTCHA Frequently Asked Questions, Available at
Ervin-Tripp, S. 2000. Methods for studying language production, In Menn, L., Ratner, N.B. (Eds), Methods for Studying Language Production, pp. 271-290.
Halliday, M. A. K. 1985. An Introduction to Functional Grammar. London: Edward Arnold.
Farris, C. S. P. 2000. Cross-sex peer conflict and the discursive production of gender in a Chinese preschool in Taiwan, Journal of Pragmatics, 32(5), pp. 539-568.
Halliday, M. A. K. 1985b. Dimensions of discourse analysis: grammar. Teun A. van Dijk (ed.), Handbook of Discourse Analysis. New York: Academic Press.
Fisher, D. A., Hill, D. L., Grube, J. W., Gruber, E. L. 2007. Gay, Lesbian, and Bisexual Content on Television: A Quantitative Analysis across Two Seasons, Journal of Homosexuality, 52(3-4), pp. 167–188.
Halliday, M. A. K., Hasan, R. 1976. Cohesion in English, Longman.
Fishman, P. M. 1978. Interaction: The Work Women Do, Social Problems, 25(4), pp. 397-406.
Hepburn, A. 2004. Crying: Notes on Description, Transcription, and Interaction, Research on Language and Social Interaction, 37(3), pp. 251–290.
Francis, G., Hunston, S. 2002. Analysing everyday conversation. In R.M. Coulthard (ed.) Advances in Spoken Discourse Analysis, London: Routledge, pp. 123–161.
Holmes, J. 1983. The functions of tag questions, English Language Research Journal, 3, pp. 40-65.
78
Anglia Ruskin University MA Applied Linguistics with TESOL Holmes, J. 1986. Functions of You Know in Women's and Men's Speech, Language in Society, 15(1), pp. 1-21.
Lauzen, M. M., Dozier, D. M. 1999. Making a difference in prime time: Women on screen and behind the scenes in the 1995-96 television season. Journal of Broadcasting & Electronic Media, 43(1), 1-19.
Holmes, J. 1990. Hedges and Boosters in Women's and Men's Speech, Language & Communication. 10(3), pp. 185205.
Leech, G. 1983. Principles of Pragmatics, Longman.
Holmes, J. 1995. Women, Men and Politeness, Pearson Longman.
Leech, G. N., Myers, G., Thomas, J. 1995. Spoken English on Computer: Transcription, Mark-Up, and Application, Longman.
Hughes, S. E. 1992. Expletives of lower working-class women, Language in Society 21, pp. 291-303.
Leech, G. 1998. Learner corpora: What they are and what can be done with them, in S. Granger (ed.), Learner English on computer, London: Addison Wesley Longman, xiv-xx.
Hymes, D. H. 1974. Foundations in Sociolinguistics. University of Pennsylvania Press, Philadelphia.
Lippi-Green, R. 1997. English with an accent: Language, ideology, and discrimination in the United States, Routledge.
Jaworski, A., Ylanne-McEwen, V., Thurlow, C., Lawson, S. 2003. Social roles and negotiation of status in host tourist interaction: A view from British television holiday programmes, Journal of Sociolinguistics, 7(2), pp. 135-163. Jay, T. 2009. The Utility and Ubiquity of Taboo Words, Perspectives on Psychological Science, 4(2), pp. 153-161.
Longman. 2011. Longman Dictionary of Contemporary English [online] Available at: http://www.ldoceonline.com/dictionary/pretty_2 [Accessed 27th May 2011].
Jucker, A. 1993. The discourse marker well: A relevancetheoretical account, Journal of Pragmatics 19, pp. 435-452.
Livia, A. 2004. Picking up the gauntlet, in M. Bucholtz, R. Lakoff (eds.) Language and Woman’s Place, Chapter 4.
Kay P., Kempton, W. 1984. What Is the Sapir-Whorf Hypothesis? American Anthropologist, New Series, 86(1), pp. 65-79
Macaulay, M. 2001. Tough talk: Indirectness and gender in requests for information, Journal of Pragmatics, 33, pp. 293-316
Kaye, P. 1989a. Laughter, ladies, and linguistics—a lighthearted quiz for language-lovers and language-learners, ELT Journal, 43(3), pp. 185-191.
Macaulay, R. 1978. Variation and Consistency in Glaswegian English, In (ed.) P. Trudgill, Sociolinguistic Patterns in British English, Edward Arnold London, pp. 132-143.
Kaye, P. 1989b. 'Women are alcoholics and drug addicts', says dictionary, ELT Journal, 43(3), pp. 192-195.
Macaulay, R. 2002. You know, it depends, Journal of Pragmatics 34, pp. 749–767
Kiesling, S. F. 2004. What Does a Focus on "Men's Language" Tell Us about Language and Woman's Place? In R. T. Lakoff, Language and Woman's Place: Text and Commentaries, Chapter 16. Oxford University Press , pp. 229-236.
MacFadden, K., Barrett, K., Horst, M. 2009. What’s in a Television Word List? A Corpus-Informed Investigation, Concordia Working Papers in Applied Linguistics, 2, pp. 7898. McCarthy, M. 2000. Discourse Analysis for Language Teachers, Tenth Edition, Cambridge University Press .
Labov, W. 2006. The Social Stratification of English in New York City, Cambridge University Press, Second Edition.
McCarthy, M. 2006. Explorations in Corpus Linguistics, Cambridge University Press.
Labov, W., Fanshel, D. 1977. Therapeutic Discourse, New York: Academic Press.
McConnell-Ginet, S. 2003. What's in a Name? Social Labelling and Gender Practices, In J. Holmes, M. Meyerhoff (eds), The Handbook of Language and Gender, Chapter 3
Lakoff, G. 1973. Hedges: A Study in Meaning Criteria and the Logic of Fuzzy Concepts, Journal of Philosophical Logic, 2 pp. 458-508.
McEnery, A., Xiao, R., Tono, Y. 2006. Corpus -based language Studies: An Advanced Resource Book, London: Routledge.
Lakoff, R. T. 1973. Language and Woman's Place, Language in Society, 2(1), pp. 45-80
Meyer, C. F. 2004. English Corpus Linguistics: An Introduction, Cambridge University Press
Lakoff, R. T. 1973b. Questionable answers and answerable questions. In: B. Kachru, R.B. Lees, Y. Malkiel, A. Pietrangeli, S. Saporta,(Eds)., Issues in linguistics. Papers in honor of Henry and Rente Kahane, University of Illinois Press.
Mills, S. (2003). Gender and Politeness. Cambridge: Cambridge University Press
Lakoff, R. T. 2001. The Language War, University of California Press
Nelson, G. 1995. The International Corpus of English: markup for spoken language. In G. Leech, G. Myers, J. Thomas (eds.) Spoken English on Computer, Chapter 18, pp. 220-223
Lakoff, R. T. 2001b, Nine Ways of Looking at Apologies: The Necessity for Interdisciplinary Theory and Method in Discourse Analysis, In D. Schiffrin, D. Tannen, H. E. Hamilton (eds.) The Handbook of Discourse Analysis, Blackwell, Chapter 10.
O’Barr, W. M., Atkins, K. B. 1997. Women's language or Powerless Language? , In: (ed) J. Coates, Language and gender: A Reader, pp. 377-387.
Lakoff, R.T. 2004. Language and Woman’s Place, Oxford University Press, 2nd Edition.
O'Keeffe, A., McCarthy, M., Carter, R. 2007. From Corpus to Classroom, Cambridge University Press.
79
Anglia Ruskin University MA Applied Linguistics with TESOL Ochs, E. 1999. Transcription as theory. In A. Jaworski & N. Coupland (Eds.), The Discourse Reader, pp. 167 182, London; New York: Routledge.
Sunderland, J. 2006. Language and Gender: An advanced resource book, Routledge. Swan, M. 2005. Practical English Usage, Third Edition, Oxford University Press.
Owen, C. 1993. Corpus-Based Grammar and the Heineken Effect: Lexico-grammatical Description for Language Learners, Applied Linguistics, 14(2), pp. 167-187.
Sweney, M. 2010. Britons 'watch four hours of TV a day', Online Available at
Pawley, A., Syder, F. H. 1983. Two puzzles for linguistic theory: Nativelike selection and nativelike fluency. In J. C. Richards, R. W. Schmidt (Eds.), Language and communication. London ; New York: Longman. pp. 191-226
Tagliamonte, S. 1998. Was/were variation across the generations: View from the city of York, Language Variation and Change, 10, pp. 153-191.
Pilkington, J. 1992. Don't try and make out that I'm nice! The different strategies women and men use when gossiping, WWPIL, 5, pp. 37-60.
Tagliamonte, S., Roberts, C. 2005. So Weird; So Cool; So Innovative: The Use of Intensifiers in the Television Series Friends, American Speech, 80(3), pp. 280-300.
Pullum, G. K. 1989. The great Eskimo vocabulary hoax, Natural Language & Linguistic Theory, 7(2), pp. 275-281. Roberts, C. 1997. Transcribing Talk: Issues Representation, TESOL Quarterly, 31(1), pp. 167-172.
Tannen, D. 1990. You Just Don't Understand: Women and Men in Conversation, Virago Press Ltd.
of
Tannen, D. 1994. Gender and Discourse, Oxford University Press.
Sacks, S. Schegloff, E. A. Jefferson, G. 1974. A Simplest Systematics for the Organization of Turn-Taking for Conversation, Language, 50(4), pp. 696-735.
The ICE Project. 2009. Available ICE Corpora @ ICEcorpora.net [online] Available at: http://icecorpora.net/ice/avail.htm [Accessed 27th May 2011]
Sauntson, H. 2007. Girls' and Boys' Use of Acknowledging Moves in Pupil Group Classroom Discussions, Language and Education, 21(4), pp. 304-327.
The Sun. 2011. Bug-hit Anne is so Weak, [online] Available at
Schegloff, E. A. 1968. Sequencing in Conversational Openings, American Anthropologist, New Series, 70(6), pp. 1075-1095. Schegloff, E. A., Sacks, H. 1973. Opening up Closings, Semiotica, 8, pp. 289–327.
Tottie, G. 1991. Conversational style in British and American English: The case of backchannels. In K. Aijmer & B. Altenberg (Eds.), English corpus linguistics: Studies in honour of Jan Svartvik (pp. 254–271). London: Longman
Schiffrin, D. 1985. Conversational Coherence: The Role of Well, Language, 61(3), pp. 640-667. Schiffrin, D. 1987. Discourse Markers, Cambridge University Press.
Tottie, G., Hoffmann, S. 2006. Tag Questions in British and American English, Journal of English Linguistics, 34(4), pp. 283-311
Schiffrin, D. 2001. Discourse Markers: Language, Meaning, and Context, In D. Schiffrin, D. Tannen, H. E. Hamilton (eds.) The Handbook of Discourse Analysis, Blackwell, Chapter 3.
Trappes-Lomax, 2004. Discourse Analysis, in (eds. Davies, A., Elder, C.) The Handbook of Applied Linguistics, Blackwell Handbooks in Linguistics, Chapter 5, pp. 133-164
Simpson, P. 2001. ‘Reason’ and ‘tickle’ as pragmatic constructs in the discourse of advertising, Journal of Pragmatics 33, pp. 589-607.
Trudgill, P. 1972. Sex, covert prestige and linguistic change in the urban British English of Norwich, Language in Society, 1(2), pp. 179-195
Sinclair, J., Coulthard, R. M. 1975. Towards an Analysis of Discourse, Oxford University Press.
Trudgill, P. 1983. Sociolinguistics: An introduction to language and society. London: Pelican.
Sinclair, J. 1992. The automatic analysis of text corpora, In J. Svartvik (ed.) Directions in Corpus Linguistics: Proceedings of the Nobel Symposium 82, Stockholm, pp. 379-397, The Hague: Mouton.
UCREL. 2000. POS-tagging Error Rates [online] Available at:
Smith, J. S. 1992. Women in Charge: Politeness and Directives in the Speech of Japanese Women, Language in Society, 21(1), pp. 59-82.
Wikipedia. 2011. The Weakest Link [online] Available at: http://en.wikipedia.org/wiki/The_Weakest_Link [Accessed 27th May 2011]
Sommers, C. H. 2001. The War Against Boys - How Misguided Feminism Is Harming Our Young Men, American Experiment Quarterly, pp.26-36.
Wikipedia, 2011b. Courney Cox [online[ Available at: http://en.wikipedia.org/wiki/Courteney_Cox {Accessed 27th May 2011]
Sommers-Flanagan, R., Sommers-Flanagan, J., Davis, B. 1993. What's happening on Music Television? A gender role content analysis, Sex Roles, 28(11-12), pp. 745-753. Stubbs, M. 1996. Texts and Corpus Analysis. Oxford: Blackwell
80
Anglia Ruskin University MA Applied Linguistics with TESOL Witt, A., Rehm, G., Hinrichs, E., Lehmberg, T., Stegmann, J. 2009. SusTEInability of linguistic resources through feature structures, Literary and Linguistic Computing, 24(3), pp. 363-372. Wolfson, N. 1988. The bulge: a theory of speech behaviour and social distance. In J. Fine (ed.) Second Language Discourse: A textbook of Current Research, Norwood, N.J.: Ablex Zimmerman, D. H., West, C. 1975, Sex Roles, Interruptions and Silences in Conversation, In B. Thorne, N. Henley (eds.) Language and Sex: Difference and Dominance, pp. 105-12
81
9. APPENDIX 9.1
QUERY FOR RETRIEVING TAG QUESTIONS
select gender, scenes.scene_interaction as si, count(id), len, ROUND((count(id)/len)*1000,2) as per_thousand # id, person, gender, line from friends, scenes, ( select distinct scene_interaction, sum(scene_length) as len from scenes group by 1 order by 1 ) inlineA where (metadata regexp ".*(VBDZ|VBDR|VBG|VBM|VBN|VBR|VBZ|VD0|VDD|VDI|VDN|VDZ|VHO|VHZ|VHG|VHI|VHD|VM).{1,15}(XX. {1,15})(PPY|PPHS1|PPHS2|PPIS2).{0,5}[[.question-mark.]]” or metadata regexp ".*(VBDZ|VBDR|VBG|VBM|VBN|VBR|VBZ|VD0|VDD|VDI|VDN|VDZ|VHO|VHZ|VHG|VHI|VHD|VM).{1,15}(XX) .*( VBDZ|VBDR|VBG|VBM|VBN|VBR|VBZ|VD0|VDD|VDI|VDN|VDZ|VHO|VHZ|VHG|VHI|VHD|VM).{1,15}( PPY|PPHS1|PPHS2|PPIS2).{0,5}[[.question-mark.]]” or metadata regexp ".*( right_RR ).{1,15}[[.question-mark.]]”) and friends.scene_id = scenes.scene_id and scenes.scene_interaction = inlineA.scene_interaction and gender in ("M", "F") and scenes.scene_interaction in (“M", “F”, “M,F”) group by 1, 2 order by 1, 2 into outfile "c:\\tag-questions.txt" lines terminated by "\r\n";
9.2
QUERY FOR RETRIEVING THE BREAKDOWN OF TAG QUESTIONS
select a.gender, a.si, a.isnt_it, ROUND(a.isnt_it/inlineA.total_length*1000,2) as isnt_it_pert, b.is_it, ROUND(b.is_it/inlineA.total_length*1000,2) as is_it_pert, c.rght, ROUND(c.rght/inlineA.total_length*1000,2) as right_pert, inlineA.total_length as len from ( # +‟ve, „-ve select gender, scenes.scene_interaction as si, count(id) as isnt_it from friends, scenes where metadata regexp ".*(VBDZ|VBDR|VBG|VBM|VBN|VBR|VBZ|VD0|VDD|VDI|VDN|VDZ|VHO|VHZ|VHG|VHI|VHD|VM).*(V BDZ|VBDR|VBG|VBM|VBN|VBR|VBZ|VD0|VDD|VDI|VDN|VDZ|VHO|VHZ|VHG|VHI|VHD|VM).{1,15}(X X.{1,15})(PPY|PPHS1|PPHS2|PPIS2).{0,5}[[.question-mark.]]” and friends.scene_id = scenes.scene_id group by 1, 2 order by 1, 2 ) a, ( # -„ve, +‟ve select gender, scenes.scene_interaction as si, count(id) as is_it from friends, scenes where metadata regexp ".*(VBDZ|VBDR|VBG|VBM|VBN|VBR|VBZ|VD0|VDD|VDI|VDN|VDZ|VHO|VHZ|VHG|VHI|VHD|VM).{1, 15}(XX).*(VBDZ|VBDR|VBG|VBM|VBN|VBR|VBZ|VD0|VDD|VDI|VDN|VDZ|VHO|VHZ|VHG|VHI|VHD|V M).{1,15}(PPY|PPHS1|PPHS2|PPIS2).{0,5}[[.question-mark.]]” and friends.scene_id = scenes.scene_id group by 1, 2 order by 1, 2 ) b, ( # -'ve + right? select gender, scenes.scene_interaction as si, count(id) as rght from friends, scenes where metadata regexp ".*( right_RR )[[.question -mark.]]” and friends.scene_id = scenes.scene_id group by 1, 2 order by 1, 2 ) c, (
A1
Anglia Ruskin University MA Applied Linguistics with TESOL select distinct scene_interaction, SUM(scene_length) as total_length from scenes group by 1 order by 1 ) inlineA where a.gender = b.gender and a.gender = c.gender and a.si = b.si and a.si = c.si and a.si in (“M", “F”, “M,F”) and a.si = inlineA.scene_interaction group by 1, 2 order by 1, 2 into outfile "c:\\tag-questions-breakdown.txt" lines terminated by "\r\n";
9.3
QUERY FOR RETRIEVING THE NUMBER OF QUESTIONS ASKED
select gender, scenes.scene_interaction, COUNT(line) as raw_count, ROUND(COUNT(line)/total_length*1000,2) as per_thousand from friends, scenes, ( select distinct scene_interaction, SUM(scene_length) as total_length from scenes group by 1 order by 1 ) inlineA where metadata regexp ".*( [[.question-mark.]]_[[.question-mark.]]).*" questions asked and friends.scene_id = scenes.scene_id and gender in (“M”, “F”) # exclude “U” and scenes.scene_interaction = inlineA.scene_interaction and scenes.scene_interaction in (“M”, “F”, “M,F”) group by 1, 2 order by 1, 2
9.4
#
QUERY FOR RETRIEVING THE STATISTICS FOR HEDGES
select gender, scenes.scene_interaction, COUNT(line) as raw_count, ROUND(COUNT(line)/total_length*1000,2) as per_thousand from friends, scenes, ( select distinct scene_interaction, SUM(scene_length) as total_length from scenes group by 1 order by 1 ) inlineA where metadata regexp "^Well_RR.*" # hedges ("Well,...") and friends.scene_id = scenes.scene_id and gender in (“M”, “F”) # exclude “U” and scenes.scene_interaction = inlineA.scene_interaction and scenes.scene_interaction in (“M”, “F”, “M,F”) group by 1, 2 order by 1, 2 where metadata regexp where metadata regexp where metadata regexp where metadata regexp where metadata regexp where metadata regexp help me please?") where metadata regexp of/sorta...") where metadata regexp
9.5
".*I_PPIS1 think_VV0.*" # hedges (“I think...”) ".*I_PPIS1 guess_VV0.*" # hedges (“I guess...”) ".*I_PPIS1 wonder_VV0.*" # hedges (“I wonder...”) ".*I_PPIS1 'm_VBM sure_JJ.*" # hedges (“I‟m sure...”) ".*(Ya know|Y‟know)" # hedges (“y‟know….”) ".*_VM.{1,10}PPY.*please_RR.{1,5}[[.?.]]" # hedges ("Could you ".*(sorta_NN1|sort_RR21 of_RR22).*"
# hedges ("...sort
"^Well_RR.*"
# hedges ("Well,...")
QUERY FOR RETRIEVING THE MOST FREQUENT ADJECTIVES
# adjectives select gender, UPPER(word) as word,
2
Anglia Ruskin University MA Applied Linguistics with TESOL COUNT(word) as count_word from words where word like "%_JJ" group by 1, 2 having COUNT(word) > 100 # only interested in the most frequent order by 1, 3 desc # highest first
9.6
QUERY FOR RETRIEVING THE MOST FREQUENT NOUNS
select gender, UPPER(word) as word, COUNT(only_word) as count_word from words where word regexp ".*_(ND1|NN|NN1|NN2|NNA|NNB|NNL1|NNL2|NNO|NNO2|NNT1|NNT2|NNU|NNU1|NNU2|NP|NP1|NP2|NPD1|N PD2|NPM1|NPM2)" and UPPER(only_word) not in ( select person from friends ) # exclude nouns which are names group by 1, 2 having COUNT(word) > 100 # only interested in the most frequent order by 1, 3 desc # highest first
9.7
QUERY FOR RETRIEVING THE MOST FREQUENT VERBS
select gender, UPPER(word) as word, COUNT(only_word) as count_word from words where word regexp ".*_(VB0|VBDR|VBDZ|VBG|VBI|VBM|VBN|VBR|VBZ|VD0|VDD|VDG|VDI|VDN|VDZ|VH0|VHD|VHG|VHI|VHN|V HZ|VM|VMK|VV0|VVD|VVG|VVGK|VVI|VVN|VVNK|VVZ)" group by 1, 2 having COUNT(word) > 100 # only interested in the most frequent order by 1, 3 desc # highest first
9.8
QUERY FOR RETRIEVING THE AVERAGE UTTERANCE LENGTH
# averages utterance length select gender, scene_interaction, ROUND(AVG(number_of_words),2) as average_number_of_words from ( select scene_id, gender, LENGTH(line)-LENGTH(REPLACE(line, “ “, “”))+1 as number_of_words from friends order by 3 limit 3042, 57805 # 5% trimmed (remove this line for full average) ) inline_table, scenes where inline_table.scene_id = scenes.scene_id and gender in ("M", "F") group by 1, 2 order by 1, 2
9.9
QUERY FOR RETRIEVING ALL SINGLE SEX WORDS
# words females use but males select UPPER(only_word), COUNT(only_word) from words where gender = "M" # performance killer and only_word not in ( select # performance killer and UPPER(only_word) not in ( group by 1 having COUNT(word) > 3 order by 2;
don't # use UPPER() so “Good” and “good” are grouped together
only_word from words where gender = "F") select distinct person from friends) # arbitrary, only report frequent unique words
# adjectives females use but males don't select UPPER(word), COUNT(word) # use UPPER() so “Good” and “good” are grouped together (as “GOOD”) from words where gender = "?" and word like "%_JJ" # performance killer and word not in ( select word from words where gender = "not ?" and word like "%_JJ" )
3
Anglia Ruskin University MA Applied Linguistics with TESOL group by 1 having COUNT(word) > 3 order by 2;
9.10 QUERY FOR RETRIEVING THE DIACHRONIC USE OF ‘REALLY’, ‘VERY’, ’SO’ select inlineA.season as season, ROUND((inlineA.count_of_variable/inlineC.total_count)*1000,2) as M_per_thousand, ROUND((inlineB.count_of_variable/inlineD.total_count)*1000,2) as F_per_thousand from ( # get number of utterances featuring the pattern so + adj. for females select left(filename, 2) as season, count(line) as count_of_variable from friends, scenes where (metadata regexp ".*so_RG .*_JJ") and gender = "M" and scenes.scene_interaction in ("M") and friends.scene_id = scenes.scene_id group by left(filename, 2) order by 1, 2 ) inlineA, ( # get number of utterances featuring the pattern so + adj. for females select left(filename, 2) as season, count(line) as count_of_variable from friends, scenes where (metadata regexp ".*so_RG .*_JJ") and gender = "F" and scenes.scene_interaction in ("F") and friends.scene_id = scenes.scene_id group by left(filename, 2) order by 1, 2 ) inlineB, ( # get total utterances for males (all scenes) select left(filename, 2) as season, count(line) as total_count from friends, scenes where scenes.scene_interaction in ("M") and gender = "M" and friends.scene_id = scenes.scene_id group by 1 ) inlineC, ( # get total utterances for males (all scenes) select left(filename, 2) as season, count(line) as total_count from friends, scenes where scenes.scene_interaction in ("F") and gender = "F" and friends.scene_id = scenes.scene_id group by 1 ) inlineD where inlineD.season = inlineA.season and inlineD.season = inlineB.season and inlineD.season = inlineC.season;
Frequency Counts -Simple # get number of utterances featuring the pattern so + adj. for females select gender, scene_interaction, count(line) as count_of_variable from friends, scenes where (metadata regexp ".*so_RG .*_JJ") and friends.scene_id = scenes.scene_id and gender in (“M”, “F”) and scene_interaction in (“M”, “F”, “M,F”) group by 1, 2 order by 1, 2;
Frequency Counts - Complex select inlineA.gender, inlineA.scene_interaction as SI, ROUND((inlineA.count_of_variable/inlineB.total_count)*1000,2) as AA from ( # get number of utterances featuring the pattern so + adj. for females Select gender, scene_interaction, count(line) as count_of_variable from friends, scenes where metadata regexp ".*`RR .*_JJ" and scenes.scene_interaction in ("M", “F”, “M,F”)
4
Anglia Ruskin University MA Applied Linguistics with TESOL and friends.scene_id = scenes.scene_id group by 1,2 order by 1 ) inlineA, ( # get total utterances for males (all scenes) Select gender, scene_interaction, count(line) as total_count from friends, scenes where scenes.scene_interaction in ("M", "F", "M,F") and friends.scene_id = scenes.scene_id group by 1, 2 order by 1 ) inlineB where inlineA.gender = inlineB.gender and inlineA.scene_interaction = inlineB.scene_interaction order by 1; # so_RG # very_RG # really_RR
9.11 QUERY FOR RETRIEVING THE NUMBER OF EMPTY ADJECTIVES select a.gender, inlineA.scene_interaction as si, empty_adj, ROUND(empty_adj/inlineA.total_length*1000,2) as empty_pt from ( select distinct scene_interaction, SUM(scene_length) as total_length from scenes group by 1 order by 1 ) inlineA, ( select gender, scene_interaction, COUNT(id) as empty_adj from friends, scenes where line regexp ".*(great|cool|gorgeous|wonderful|divine|pretty|lovely|good|fantastic|charming|sweet|ado rable).*" and friends.scene_id = scenes.scene_id group by 1, 2 order by 1, 3 desc ) a where inlineA.scene_interaction in ("M", "F", "M,F") and inlineA.scene_interaction = a.scene_interaction group by 1, 2 order by 1, 2; # replace with: (great|cool|gorgeous|wonderful|divine|pretty|lovely|good|fantastic|charming|sweet|adorab le)
9.12 QUERY FOR RETRIEVING THE NUMBER OF PRONOUN REFERENCES select gender, scenes.scene_interaction, COUNT(line) as raw_count, ROUND(COUNT(line)/total_length*1000,2) as per_thousand from friends, scenes, ( select distinct scene_interaction, SUM(scene_length) as total_length from scenes group by 1 order by 1 ) inlineA where metadata regexp ".*(PPIS2|PPIO2).*" # PPIS1|PPIO1 = I/my PPIS2|PPIO2 = We/our and friends.scene_id = scenes.scene_id and gender in (“M”, “F”) # exclude “U” and scenes.scene_interaction = inlineA.scene_interaction and scenes.scene_interaction in (“M”, “F”, “M,F”) group by 1, 2 order by 1, 2
5
#
Anglia Ruskin University MA Applied Linguistics with TESOL
9.13 QUERY FOR RETRIEVING THE NUMBER OF TABOO WORDS General Counts by gender # taboo vocabulary list taken from Jay (2009) select friends.gender, scenes.scene_interaction as si, no_of_utt, round(count(id)/no_of_utt*1000, 2) as per_1000, count(id) from friends, scenes, ( select gender, scene_interaction, count(id) as no_of_utt from friends, scenes where friends.scene_id = scenes.scene_id and friends.gender in ("M", "F") and scenes.scene_interaction in ("M", "F", "M,F") group by 1, 2 order by 1 ) inlineA where (metadata regexp ".* fuck_.*" or metadata regexp ".* shit_.*" or metadata regexp ".* hell_.*" or metadata regexp ".* damn_.*" or metadata regexp ".* goddamn_.*" or metadata regexp ".* Christ_.*" # Jesus Christ or metadata regexp ".* ass_.*" or metadata regexp ".* god_.*" # Oh my god or metadata regexp ".* bitch_.*" or metadata regexp ".* sucks_.*") and friends.gender = inlineA.gender and friends.scene_id = scenes.scene_id and scenes.scene_interaction in ("M", "F", "M,F") and scenes.scene_interaction = inlineA.scene_interaction group by 1, 2, 3 order by per_1000
General Counts by actor # taboo vocabulary list taken from Jay (2009) select friends.person, gender, no_of_utt, round(count(id)/no_of_utt*1000, 2) as per_1000, count(id) from friends, scenes, ( select distinct person, count(id) as no_of_utt from friends where friends.person in ("ROSS", "RACHEL", "MONICA", "JOEY", "CHANDLER", "PHOEBE") group by 1 order by 1 ) inlineA where (metadata regexp ".* fuck_.*" or metadata regexp ".* shit_.*" or metadata regexp ".* hell_.*" or metadata regexp ".* damn_.*" or metadata regexp ".* goddamn_.*" or metadata regexp ".* Christ_.*" # Jesus Christ or metadata regexp ".* ass_.*" or metadata regexp ".* god_.*" # Oh my god or metadata regexp ".* bitch_.*" or metadata regexp ".* sucks_.*") and friends.person in ("ROSS", "RACHEL", "MONICA", "JOEY", "CHANDLER", "PHOEBE") and friends.person = inlineA.person and friends.scene_id = scenes.scene_id and scenes.scene_interaction in ("M", "F", "M,F") group by 1, 2 order by per_1000 into outfile “c:\\taboo-general.txt” lines terminated by “\r\n”;
6
Anglia Ruskin University MA Applied Linguistics with TESOL
Counts per actor per scene # !!! this will normalize the frequency based on the actors number of utterances for each scene type !!! # taboo vocabulary list taken from Jay (2009) select friends.person, gender, scenes.scene_interaction as SI, no_of_utt, round(count(id)/no_of_utt*1000, 2) as per_1000, count(id) from friends, scenes, ( select distinct person, scene_interaction, count(id) as no_of_utt from friends, scenes where friends.scene_id = scenes.scene_id and friends.person in ("ROSS", "RACHEL", "M ONICA", "JOEY", "CHANDLER", "PHOEBE") and scene_interaction in ("M", "F", "M,F") group by 1, 2 order by 1 ) inlineA where (metadata regexp ".* fuck_.*" or metadata regexp ".* shit_.*" or metadata regexp ".* hell_.*" or metadata regexp ".* damn_.*" or metadata regexp ".* goddamn_.*" or metadata regexp ".* Christ_.*" # Jesus Christ or metadata regexp ".* ass_.*" or metadata regexp ".* god_.*" # Oh my god or metadata regexp ".* bitch_.*" or metadata regexp ".* sucks_.*") and friends.person in ("ROSS", "RACHEL", "MONICA", "JOEY", "CHANDLER", "PHOEBE") and friends.person = inlineA.person and friends.scene_id = scenes.scene_id and scenes.scene_interaction in ("M", "F", "M,F") and scenes.scene_interaction = inlineA.scene_interaction group by 1, 2, 3 order by per_1000 into outfile “c:\\taboo-general.txt” lines terminated by “\r\n”;
Counts per individual taboo word # !!! this will normalize the frequency based on the actors number of utterances for each scene type !!! # taboo vocabulary list taken from Jay (2009) select friends.person, gender, no_of_utt, round(count(id)/no_of_utt*1000, 2) as per_1000, count(id) from friends, ( select distinct person, count(id) as no_of_utt from friends where friends.person in ("ROSS", "RACHEL", "MONICA", "JOEY", "CHANDLER", "PHOEBE") group by 1 order by 1 ) inlineA where metadata regexp ".* god_.*" and friends.person in ("ROSS", "RACHEL", "MONICA", "JOEY", "CHANDLER", "PHOEBE") and friends.person = inlineA.person group by 1, 2 order by per_1000 ".* ".* ".* ".* ".* ".* ".* ".* ".* ".*
fuck_.*" shit_.*" hell_.*" damn_.*" goddamn_.*" Christ_.*" # Jesus Christ ass_.*" god_.*" # Oh my god bitch_.*" sucks_.*")
7
Anglia Ruskin University MA Applied Linguistics with TESOL
9.14 QUERY FOR RETRIEVING THE COLOURS USED BY THE GENDERS # word list was taken from Wikipedia: http://en.wikipedia. org/wiki/List_of_colors select distinct lower(word), gender, count(word) from words where only_word in (“Anti-flash white”, “Beige”, “Cosmic latte”, “Cream”, “Eggshell”, “Ghost white”, “Isabelline”, “Ivory”, “Magnolia”, “Old lace”, ““, “Pearl”, “Seashell”, “Splashed white”, “Vanilla”, “White”, “Amaranth”, “Amaranth pink”, “Brink pink”, “Carmine pink”, “Carnation pink”, “Cerise”, “Cerise pink”, “Cherry blossom pink”, “Coral pink”, “Dark pink”, “Deep carmine pink”, “Deep pink”, “Fandango”, “French rose”, “Fuchsia”, “Fuchsia pink”, “Hollywood cerise”, “Hot magenta”, “Hot pink”, “Lavender pink”, “Light pink”, “Light thulian pink”, “Magenta”, “Mountbatten pink”, “Nadeshiko pink”, “Persian pink”, “Persian rose”, “Pink”, “Puce”, “Rose”, “Rose pink”, “Ruby”, “Salmon pink”, “Shocking pink”, “Tea rose”, “Thulian pink”, “Ultra pink”, “Variations of pink”, “Alizarin crimson”, “Amaranth”, “American Rose”, “Auburn”, “Burgundy”, “Burntsienna”, “Candy apple red”, “Cardinal”, “Carmine”, “Carnelian”, “Cerise”, “Chestnut”, “Coquelicot”, “Coral red”, “Crimson”, “Dark pink”, “Falu red”, “Fire brick”, “Fire engine red”, “Flame”, “Fuchsia”, “Lava”, “Lust”, “Magenta”, “Maroon”, “Mauve”, “Mauve taupe”, “Orange-red”, “Persian red”, “Persimmon”, “Pink”, “Raspberry”, “Red”, “Redviolet”, “Redwood”, “Rose”, “Rose madder”, “Rosewood”, “Rosso corsa”, “Ruby”, “Rufous”, “Rust”, “Sangria”, “Scarlet”, “Sinopia”, “Terra cotta”, “Tuscan red”, “Upsdell red”, “Venetian red”, “Vermilion”, “Wine”, “Amber”, “Apricot”, “Atomic tangerine”, “Brown”, “Burnt orange”, “Carrot orange”, “Champagne”, “Coral”, “Dark salmon”, “Deep carrot orange”, “ECE/SAE Amber”, “Flame”, “Gamboge”, “Gold”, “Gold (metallic)”, “International orange”, “Mahogany”, “Orange”, “Orange-red”, “Orange peel”, “Papaya whip”, “Peach”, “Peach-orange”, “Peach-yellow”, “Persian orange”, “Persimmon”, “Pink-orange”, “Portland Orange”, “Princeton orange”, “Pumpkin”, “Rust”, “Safety orange”, “Salmon”, “Sunset”, “Tangelo”, “Tangerine”, “Tea rose”, “Tenné”, “Tomato”, “Vermilion”, “Auburn”, “Beige”, “Bistre”, “Bole”, “Bronze”, “Brown”, “Buff”, “Burgundy”, “Burnt sienna”, “Burnt umber”, “Camel”, “Chamoisee”, “Chestnut”, “Chocolate”, “Citrine”, “Copper”, “Cordovan”, “Desert sand”, “Earth yellow”, “Ecru”, “Fallow”, “Fawn”, “Fulvous”, “Isabelline”, “Khaki”, “Liver”, “Mahogany”, “Maroon”, “Ochre”, “Raw umber”, “Redwood”, “Rufous”, “Russet”, “Rust”, “Sandy brown”, “Seal brown”, “Sepia”, “Sienna”, “Sinopia”, “Tan”, “Taupe”, “Tawny”, “Umber”, “Wenge”, “Wheat”, “Amber”, “Apricot”, “Arylide yellow”, “Aureolin”, “Beige”, “Blond”, “Buff”, “Chartreuse yellow”, “Chrome yellow”, “Citrine”, “Cream”, “Dark goldenrod”, “Ecru”, “Flavescent”, “Flax”, “Fulvous”, “Gamboge”, “Gold”, “Gold (metallic)”, “Goldenrod”, “Golden poppy”, “Golden yellow”, “Green-yellow”, “Hansa yellow”, “Icterine”, “Isabelline”, “Jasmine”, “Jonquil”, “Khaki”, “Lemon”, “Lemon chiffon”, “Lime”, “Maize”, “Mikado yellow”, “Mustard”, “Naples yellow”, “Navajo white”, “Old gold”, “Olive”, “Pale gold”, “Papaya whip”, “Peach-yellow”, “Pear”, “Saffron”, “School bus yellow”, “Selective yellow”, “Stil de grain yellow”, “Sunglow”, “Tangerine yellow”, “Titanium yellow”, ““, “Urobilin”, ““, “Vanilla”, “Vegas gold”, “Yellow”, “Gray”, “Arsenic”, “Ash gray”, “Battleship gray”, “Bistre”, “Black”, “Cadet gray”, “Charcoal”, “Cinereous”, “Cool gray”, “Davy's gray”, “Feldgrau”, “Glaucous”, “Isabelline”, “Liver”, “Payne's gray”, “Platinum”, “Seal brown”, “Silver”, “Slate gray”, “Taupe”, “Purple taupe”, “Medium taupe”, “Taupe gray”, “Pale taupe”, “Rose quartz”, “White”, “Xanadu”, “Army green”, “Asparagus”, “Bright green”, “British racing green”, “Cal Poly Pomona green”, “Camouflage green”, “Celadon”, “Chartreuse”, “Clover”, “Dartmouth green”, “Electric green”, “Emerald”, “Fern green”, “Forest green”, “Grayasparagus”, “Green”, “Green-yellow”, “Harlequin”, “Honeydew”, “Hooker's green”, “Hunter green”, “India green”, “Islamic green”, “Jade”, “Jungle green”, “Kelly green”, “Lime”, “Lime green”, “Midnight green”, “Mint cream”, “Moss green”, “MSU Green”, “Myrtle”, “Neon green”, “Office green”, “Olive”, “Olive drab”, “Pakistan green”, “Paris Green”, “Pear”, “Persian green”, “Phthalo green”, “Pigment green”, “Pine green”, “Rifle green”, “Sacramento State green”, “Sap green”, “Sea green”, “Shamrock green”, “Spring bud”, “Spring green”, “Tea green”, “Teal”, “UP Forest green”, “Viridian”, “Yellow-green”, “Variations of green”, “Alice blue”, “Aqua”, “Aquamarine”, “Baby blue”, “Bondi blue”, “Cerulean”, “Cyan”, “Electric blue”, “Midnight green”, “Pine green”, “Robin egg blue”, “Teal”, “Turquoise”, “Verdigris”, “Viridian”, “Air Force blue”, “Alice blue”, “Azure”, “Baby blue”, “Bleu de France”, “Blue”, “Bondi blue”, “Brandeis blue”, “Cambridge Blue”, “Carolina blue”, “Ceil”, “Cerulean”, “Cobalt blue”, “Columbia blue”, “Cornflower blue”, “Cyan”, “Dark blue”, “Deep sky blue”, “Denim”, “Dodger blue”, “Duke blue”, “Egyptian blue”, “Electric blue”, “Eton blue”, “Federal blue”, “Glaucous”, “Han blue”, “Iceberg”, “Indigo”, “International Klein Blue”, “Iris”, “Light blue”, “Majorelle Blue”, “Maya blue”, “Midnight blue”, “Navy blue”, “Non-photo blue”, “Palatinate blue”, “Periwinkle”, “Persian blue”, “Phthalo blue”, “Powder blue”, “Prussian blue”, “Royal blue”, “Sapphire”, “Sky blue”, “Steel blue”, “Teal”, “Tiffany Blue”, “True Blue”, “Tufts Blue”, “Turquoise”, “UCLA Blue”, “Ultramarine”, “Yale Blue”, “Amethyst”, “Byzantium”, “Cerise”, “Eggplant”, “Fandango”, “Fuchsia”, “Han purple”, “Heliotrope”, “Indigo”, “Iris”, “Lavender (floral)”, “Lavender”, “Lavender blush”, “Lilac”, “Magenta”, “Mauve”, “Orchid”, “Palatinate purple”, “Periwinkle”, “Persian blue”, “Purple”, “Red-violet”, “Regalia”, “Rose”, “Sangria”, “Thistle”, “Tyrian purple”, “Violet”, “Wisteria”, “black”, “gray”, “silver”, “white”, “maroon”, “red”, “purple”, “fuchsia”, “green”, “lime”, “olive”, “yellow”, “navy”, “blue”, “teal”, “aqua”) and right(word, 2) = “JJ” # adjectives only please! # and gender = “F” group by 1, 2 order by 3
8
Anglia Ruskin University MA Applied Linguistics with TESOL
9.15 QUERY FOR RETRIEVING THE BACK CHANNELLING VOCABULARY select line, count(id) from friends where length(line) < 15 group by 1 having count(id) > 10 order by 2;
# arbitrary
9.16 THE CORPUS The corpus is available for those who wish use it in further research. The text files are approximately 15 megabytes in size but compress to less than 4 megabytes. Corpus files and SQL statements for the creation of the relevant SQL tables are available free of charge at: https://sites.google.com/site/friendstvcorpus/ or via email at ayliffe.david@gmail.com
9
Anglia Ruskin University MA Applied Linguistics with TESOL
10. COPYRIGHT Attention is drawn to the fact that copyright of this Dissertation rests with: (i) Anglia Ruskin University for one year and thereafter with (ii) Mr. David Ayliffe
This copy of the Dissertation has been supplied on condition that anyone who consults it is bound by copyright.
This work may (i) be made available for consultation within Anglia Ruskin University Library or (ii) be lent to other libraries for the purpose of consultation or may be photocopied for such purposes.
10