Indirect and direct use of corpora

Table of contents: The Kazakh-American Free University Academic Journal №7 - 2015

Zhanuzakova Nadezhda, Kazakh-American Free University, Kazakhstan
Chzhan Yelena, Kazakh-American Free University, Kazakhstan

The use of corpora in language teaching and learning has been more indirect than direct. This is perhaps because the direct use of corpora in language pedagogy is restricted by a number of factors including, for example, the level and experience of learners, time constraints, curricular requirements, knowledge and skills required of teachers for corpus analysis and pedagogical mediation, and the access to resources such as computers, and appropriate software tools and corpora, or a combination of these.

1. Reference publishing

Corpora can be said to have revolutionized reference publishing (at least for English), be it a dictionary or a reference grammar, in such a way that dictionaries published since the 1990s are typically have used corpus data in one way or another so that ‘even people who have never heard of a corpus are using the product of corpus-based investigation’ (Hunston 2002: 96) [6].

Corpora are useful in several ways for lexicographers. The greatest advantage of using corpora in lexicography lies in their machine-readable nature, which allows dictionary makers to extract all authentic, typical examples of the usage of a lexical item from a large body of text in a few seconds. The second advantage of the corpus-based approach, which is not readily available when using citation slips, is the frequency information and quantification of collocation which a corpus can readily provide. Some dictionaries, e.g. Cobuild 1995 and Longman 1995, include such frequency information. Frequency data plays an even more important role in the so-called frequency dictionaries, which define core vocabulary to help learners of different modern languages, e.g. Davies (2005) for Spanish, Jones and Tschirner (2005) for German, Davies and de Oliveira Preto-Bay (2007) for Portuguese, Lonsdale and Bras (2009) for French, and Xiao, Rayson and McEnery (2009) for Chinese. Information of this sort is particularly useful for materials writers and language learners alike [12].

A further benefit of using corpora is related to corpus mark-up and annotation. Many available corpora (e.g. the British National Corpus, BNC) are encoded with textual (e.g. register, genre and domain) and sociolinguistic (e.g. user gender and age) metadata which allows lexicographers to give a more accurate description of the usage of a lexical item. Corpus annotations such as part-of-speech tagging and word sense disambiguation also enable a more sensible grouping of words which are polysemous and homographs. Furthermore, a monitor corpus, which is constantly updated, allows lexicographers to track subtle change in the meaning and usage of a lexical item so as to keep their dictionaries up-to-date.

Last but not least, corpus evidence can complement or refute the intuitions of individual lexicographers, which are not always reliable (Sinclair 1991: 112; Atkins and Levin 1995; Murison-Bowie 1996: 184) so that dictionary entries are more accurate. Hunston (2002: 96) [6] summarizes the changes brought about by corpora to dictionaries and other reference books in terms of five ‘emphases’: an emphasis on frequency, an emphasis on collocation and phraseology, an emphasis on variation, an emphasis on lexis in grammar, and an emphasis on authenticity [12].

It has been noted that non-corpus-based grammars can contain biases while corpora can help to improve grammatical descriptions (McEnery and Xiao 2005) [10]. The Longman Grammar of Spoken and Written English (Biber et al 1999)canbe considered as a new milestone in reference publishing following Quirk et al’s (1985) Comprehensive Grammar of the English Language. Based entirely on the 40-million-word Longman Spoken and Written English Corpus, the book gives ‘a thorough description of English grammar, which is illustrated throughout with real corpus examples, and which gives equal attention to the ways speakers and writers actually use these linguistic resources’ (Biber et al 1999: 45) [2]. The new corpus-based grammar is unique in many different ways, for example, by taking account of register variations and exploring the differences between written and spoken grammars.

While lexical information forms, to some extent, an integral part of the grammatical description in Biber et al (1999), it is the Collins COBUILD series (Sinclair 1990, 1992; Francis et al 1997; 1998), that focus on lexis in grammatical descriptions (the so-called ‘pattern grammar’, Hunston and Francis 2000). In fact, Sinclair et al (1990) flatly reject the distinction between lexis and grammar. While pattern grammars focusing on the connection between pattern and meaning challenge the traditional distinction between lexis and grammar, they are undoubtedly useful in language learning as they provide ‘a resource for vocabulary building in which the word is treated as part of a phrase rather than in isolation’ (Hunston 2002: 106) [6].

For language pedagogy the most important developments in lexicography relate to the learner dictionary. Yet corpus-based learner dictionaries have a quite short history. It was only in 1987 that the Collins Co build English Language Dictionary (Sinclair 1987) [14] was published as the first ‘fully corpus-based’ dictionary. Yet the impact of this corpus-based dictionary was such that most other publishers in the ELT market followed Collins’ lead. By 1995, the new editions of major learner dictionaries such as the Longman Dictionary of Contemporary English (3rd edition), the Oxford Advanced Learner’s Dictionary (5th edition, Hornby and Crowther 1999), and a newcomer, the Cambridge International Dictionary of English (Procter 1999) all claimed to be based on corpus evidence in one way or another.

Syllabus design and materials development

While corpora have been used extensively to provide more accurate descriptions of language use, a number of scholars have also used corpus data directly to look critically at existing TEFL (Teaching English as a Foreign Language) syllabuses and teaching materials. Mindt (1996), for example, finds that the use of grammatical structures in textbooks for teaching English differs considerably from the use of these structures in L1 English. He observes that one common failure of English textbooks is that they teach ‘a kind of school English which does not seem to exist outside the foreign language classroom’ (Mindt 1996: 232) [13]. As such, learners often find it difficult to communicate successfully with native speakers. A simple yet important role of corpora in language education is to provide more realistic examples of language usage that reflect the complexities and nuances of natural language.

A focus of the lexical approach to language pedagogy is teaching collocations (i.e. habitual co-occurrences of lexical items) and the related concept of prefabricated units. There is a consensus that collocational knowledge is important for developing L1/L2 language skills. Collocational knowledge indicates which lexical items co-occur frequently with others and how they combine within a sentence. Such knowledge is evidently more important than individual words themselves (Kita and Ogata 1997: 230) and is needed for effective sentence generation (Smadja and Mc Keown 1990) [12].

Corpora are useful in this respect, not only because collocations can only reliably be measured quantitatively, but also because the KWIC (key word in context) view of corpus data exposes learners to a great deal of authentic data in a structured way. Kennedy (2003) [8]discusses the relationship between corpus data and the nature of language learning, focusing on the teaching of collocations. The author argues that second or foreign language learning is a process of learning ‘explicit knowledge’ with awareness, which requires a great deal of exposure to language data.

In addition to the lexical focus, corpus-based teaching materials try to demonstrate how the target language is actually used in different contexts, as exemplified in Biber et al’s (2002) [3]Longman Student Grammar of Spoken and Written English, which pays special attention to how English is used differently in various spoken and written registers.

2. Language testing

Another emerging area of language pedagogy which has started to use the corpus-based approach is language testing. Alderson (1996) envisaged the following possible uses of corpora in this area: test construction, compilation and selection, test presentation, response capture, test scoring, and calculation and delivery of results. He concludes that ‘the potential advantages of basing our tests on real language data, of making data-based judgments about candidates’ abilities, knowledge and performance are clear enough. A crucial question is whether the possible advantages are born out in practice’ (Alderson 1996: 258-259) [1]. The concern raised in Alderson’s conclusion appears to have been addressed satisfactorily now so that nowadays computer-based tests are recognized as being comparable to paper-based tests (e.g. computer-based versus paper-based TOEFL tests).

A number of corpus-based studies of language testing have been reported. For example, Coniam (1997) demonstrated how to use word frequency data extracted from corpora to generate cloze tests automatically. Kaszubski and Wojnowska (2003) presented a corpus-driven computer program, Test Builder, for building sentence-based ELT exercises. The program can process raw corpora of plain texts or corpora annotated with part-of-speech information, using another linked computer program that assigns the part-of-speech category to each word in the corpus automatically in real time. The annotated data is used in turn as input for test material selection. Indeed, corpora have recently been used by major providers of test services for a number of purposes [12]:

- as an archive of examination scripts;

- to develop test materials;

- to optimize test procedures;

- to improve the quality of test marking;

- to validate tests; and

- to standardize tests.

3. Teacher development

For learners to benefit from the use of corpora, language teachers must first of all be equipped with a sound knowledge of the corpus-based approach. It is unsurprising then to discover that corpora have been used in training language teachers (Allan 1999, 2002; Conrad 1999; Seidlhofer 2000, 2002; O’Keeffe and Farr 2003). Allan (1999), for example, demonstrates how to use corpus data to raise the language awareness of English teachers in Hong Kong secondary schools. Conrad (1999) presents a corpus-based study of linking adverbials (e.g. therefore and in other words), on the basis of which she suggests that it is important for a language teacher to do more than using classroom concordancing and lexical or lexico-grammatical analyses if language teaching is to take full advantage of the corpus-based approach. Conrad’s concern with teacher education is echoed by O’Keeffe and Farr (2003), who argue that corpus linguistics should be included in initial language teacher education so as to enhance teachers’ research skills and language awareness [12].

While indirect uses such as syllabus design and materials development are closely associated with what to teach, corpora have also provided valuable insights into how to teach. Of Leech’s (1997) [9] three focuses, direct uses of corpora include ‘teaching about’, ‘teaching to exploit’, and ‘exploiting to teach’, with the latter two relating to how to use. Though with a number of restricting factors, direct uses have so far been confined largely to learning at more advanced levels, for example, in tertiary education, whereas in general English language teaching, especially in secondary education(Braun 2007) [4], the direct use of corpora is ‘still conspicuously absent’ (Kaltenböck and Mehlmauer-Larcher 2005) [7].

‘Teaching about’ means teaching corpus linguistics as an academic subject like other sub-disciplines of linguistics such as syntax and pragmatics. Corpus linguistics has now found its way into the curricula for linguistics and language related degree programmers at both postgraduate and undergraduate levels in many universities around the world. ‘Teaching to exploit’ means providing students with ‘hands-on’ know-how, as emphasized in McEnery, Xiao and Tono (2006) [11], so that they can exploit corpora for their own purposes. Once the student has acquired the necessary knowledge and techniques of corpus-based language study, the learning activity may become student centre. ‘Exploiting to teach’ means using a corpus-based approach to teaching language and linguistics courses (e.g. sociolinguistics and discourse analysis), which would otherwise be taught using non-corpus-based methods.

If the focuses of ‘teaching about’ and ‘exploiting to teach’ are viewed as being associated typically with students of linguistics and language programmers, ‘teaching to exploit’ relates to students of all subjects which involve language study and learning, who are expected to benefit from the so-called data-driven learning (DDL) or ‘discovery learning’.

The issue of how to use corpora in the language classroom has been discussed extensively in the literature. With the corpus-based approach to language pedagogy, the traditional ‘three P’s’ (Presentation – Practice – Production) approach to teaching may not be entirely suitable. Instead, the more exploratory approach of ‘three I’s’ (Illustration – Interaction – Induction) may be more appropriate, where ‘illustration’ means looking at real data, ‘interaction’ means discussing and sharing opinions and observations, and ‘induction’ means making one’s own rule for a particular feature, which ‘will be refined and honed as more and more data is encountered’ (Carter and McCarthy 1995: 155) [5]. This progressive induction approach is what Murison-Bowie (1996: 191) would call the interlanguage approach: namely, partial and incomplete generalizations are drawn from limited data as a stage on the way towards a fully satisfactory rule. While the ‘three I’s’ approach was originally proposed by Carter and McCarthy (1995) to teach spoken grammar, it may also apply to language education as a whole [12].

It is clear that the exploratory teaching approach focusing on ‘three I’s’ is in line with Johns’ (1991) concept of ‘data-driven learning (DDL)’. Johns was perhaps among the first to realize the potential of corpora for language learners (Higgins and Johns 1984). In his opinion, ‘research is too serious to be left to the researchers’ (Johns 1991: 2). As such, he argues that the language learner should be encouraged to become ‘a research worker whose learning needs to be driven by access to linguistic data’ (ibid) [12].

Data-driven learning can be either teacher-directed or learner-led (i.e. discovery learning) to suit the needs of learners at different levels, but it is basically learner-centred. This autonomous learning process ‘gives the student the realistic expectation of breaking new ground as a “researcher”, doing something which is a unique and individual contribution’ (Leech 1997: 10) [9]. It is important to note, however, that the key to successful data-driven learning, even if it is student-centred, is the appropriate level of teacher guidance or pedagogical mediation depending on the learners’ age, experience, and proficiency level, because ‘a corpus is not a simple object, and it is just as easy to derive nonsensical conclusions from the evidence as insightful ones’ (Sinclair 2004: 2). In this sense, it is even more important for language teachers to be equipped with the necessary training in corpus analysis [12].

Johns (1991) identifies three stages of inductive reasoning with corpora in the DDL approach: observation (of concordance evidence), classification (of salient features) and generalization (of rules). The three stages roughly correspond to Carter and McCarthy’s (1995) ‘three I’s’. The DDL approach is fundamentally different from the ‘three P’s’ approach in that the former involves bottom-up induction whereas the latter involves top-down deduction. The direct use of corpora and concordance in the language classroom has been discussed extensively in the literature, covering a wide range of issues including, for example, underlying theories, methods and techniques, and problems and solutions [12].


1. Alderson, J.C. (1996). Do corpora have a role in language assessment? In J.A. Thomas &M.H.Short (Eds.), Using corpora for language research (pp. 248-59). London: Longman.

2. Biber, D., Johansson, S., Leech, J., Conrad, S. &Finegan, E. (1999). Longman grammar of spoken and written English. London: Longman.

3. Biber, D., Leech, G. and Conrad, S. (2002) Longman Student Grammar of Spoken and Written English. Harlow: Longman.

4. Braun, S. (2007) ‘Integrating corpus work into secondary education: From data-driven learning to needs-driven corpora’. ReCALL 19/3: 307-328.

5. Carter, R. and McCarthy, M. (1995) ‘Grammar and the spoken language’. Applied Linguistics 16/2: 141-158.

6. Hunston, S. (2002) Corpora in Applied Linguistics. Cambridge: Cambridge University Press.

7. Kaltenböck, G. and Mehlmauer - Larcher, B. (2005) ‘Computer corpora and the language classroom: On the potential and limitations of computer corpora in language teaching’. Re CALL17: 65-84.

8. Kennedy, G. (2003) ‘Amplifier collocations in the British National Corpus: Implications for English language teaching’. TESOL Quarterly 37/3: 467-487.

9. Leech, G. (1997). Teaching and language corpora: A convergence. In A. Wichmann, S. Fligelstone, T. Mc Enery& G. Knowles (Eds.), Teaching and language corpora (pp. 1-23). New York: Addison Wesley Longman.

10. McEnery, A. and Xiao, R. (2005) ‘Help or help to: What do corpora have to say?’ English Studies 86(2): 161-187.

11. McEnery, T., Xiao, R. and Tono, Y. (2006) Corpus-Based Language Studies: An Advanced Resource Book. London: Routledge.

12. Tony McEnery, Richard Xiao. What Corpora Can Offer in Language Teaching and Learning, 2011.

13. Mindt, D. (1996) ‘English corpus linguistics and the foreign language teaching syllabus’ in J. Thomas and M. Short (eds.) Using Corpora for Language Research, pp. 232-247. Harlow: Longman.

14. Sinclair, J. (1987) Collins COBUILD English Language Dictionary. London: HarperCollins.

Table of contents: The Kazakh-American Free University Academic Journal №7 - 2015

About journal
About KAFU

   © 2017 - KAFU Academic Journal