Indirect and direct use of corpora
Table of contents: The Kazakh-American Free University Academic Journal №7 - 2015
Zhanuzakova Nadezhda, Kazakh-American Free University, Kazakhstan
Chzhan Yelena, Kazakh-American Free University, Kazakhstan
The use of corpora in language teaching and learning has been more
indirect than direct. This is perhaps because the direct use of corpora in
language pedagogy is restricted by a number of factors including, for example,
the level and experience of learners, time constraints, curricular
requirements, knowledge and skills required of teachers for corpus analysis and
pedagogical mediation, and the access to resources such as computers, and
appropriate software tools and corpora, or a combination of these.
1. Reference publishing
Corpora can be
said to have revolutionized reference publishing (at least for English), be it
a dictionary or a reference grammar, in such a way that dictionaries published
since the 1990s are typically have used corpus data in one way or another so
that ‘even people who have never heard of a corpus are using the product of
corpus-based investigation’ (Hunston 2002: 96) .
useful in several ways for lexicographers. The greatest advantage of using
corpora in lexicography lies in their machine-readable nature, which
allows dictionary makers to extract all authentic, typical examples of the
usage of a lexical item from a large body of text in a few seconds. The second
advantage of the corpus-based approach, which is not readily available when
using citation slips, is the frequency information and quantification of
collocation which a corpus can readily provide. Some dictionaries, e.g. Cobuild 1995 and Longman 1995, include
such frequency information. Frequency data plays an even more important role in
the so-called frequency dictionaries, which define core vocabulary to help
learners of different modern languages, e.g. Davies (2005) for Spanish, Jones
and Tschirner (2005) for German, Davies and de Oliveira Preto-Bay (2007) for
Portuguese, Lonsdale and Bras (2009) for French, and Xiao, Rayson and McEnery
(2009) for Chinese. Information of this sort is particularly useful for
materials writers and language learners alike .
benefit of using corpora is related to corpus mark-up and annotation.
Many available corpora (e.g. the British National Corpus, BNC) are encoded with
textual (e.g. register, genre and domain) and sociolinguistic (e.g. user gender
and age) metadata which allows lexicographers to give a more accurate
description of the usage of a lexical item. Corpus annotations such as
part-of-speech tagging and word sense disambiguation also enable a more
sensible grouping of words which are polysemous and homographs. Furthermore, a
monitor corpus, which is constantly updated, allows lexicographers to track
subtle change in the meaning and usage of a lexical item so as to keep their
Last but not
least, corpus evidence can complement or refute the intuitions of individual
lexicographers, which are not always reliable (Sinclair 1991: 112; Atkins and
Levin 1995; Murison-Bowie 1996: 184) so that dictionary entries are more accurate.
Hunston (2002: 96)  summarizes the changes brought about by corpora
to dictionaries and other reference books in terms of five ‘emphases’: an
emphasis on frequency, an emphasis on collocation and phraseology, an emphasis
on variation, an emphasis on lexis in grammar, and an emphasis on authenticity
It has been
noted that non-corpus-based grammars can contain biases while corpora can help
to improve grammatical descriptions (McEnery and Xiao 2005) . The Longman
Grammar of Spoken and Written English (Biber et al 1999)canbe considered as
a new milestone in reference publishing following Quirk et al’s (1985) Comprehensive
Grammar of the English Language. Based entirely on the 40-million-word
Longman Spoken and Written English Corpus, the book gives ‘a thorough
description of English grammar, which is illustrated throughout with real
corpus examples, and which gives equal attention to the ways speakers and
writers actually use these linguistic resources’ (Biber et al 1999: 45) .
The new corpus-based grammar is unique in many different ways, for example, by
taking account of register variations and exploring the differences between
written and spoken grammars.
While lexical information forms, to
some extent, an integral part of the grammatical description in Biber et al
(1999), it is the Collins COBUILD series (Sinclair 1990, 1992; Francis et al
1997; 1998), that focus on lexis in grammatical descriptions (the so-called
‘pattern grammar’, Hunston and Francis 2000). In fact, Sinclair et al (1990)
flatly reject the distinction between lexis and grammar. While pattern grammars
focusing on the connection between pattern and meaning challenge the
traditional distinction between lexis and grammar, they are undoubtedly useful
in language learning as they provide ‘a resource for vocabulary building in
which the word is treated as part of a phrase rather than in isolation’
(Hunston 2002: 106) .
For language pedagogy the most
important developments in lexicography relate to the learner dictionary. Yet
corpus-based learner dictionaries have a quite short history. It was only in
1987 that the Collins Co build
English Language Dictionary (Sinclair 1987)  was published as the first
‘fully corpus-based’ dictionary. Yet the impact of this corpus-based dictionary
was such that most other publishers in the ELT market followed Collins’ lead.
By 1995, the new editions of major learner dictionaries such as the Longman
Dictionary of Contemporary English (3rd edition), the Oxford
Advanced Learner’s Dictionary (5th edition, Hornby and Crowther 1999), and a newcomer, the Cambridge
International Dictionary of English (Procter 1999) all claimed to be based
on corpus evidence in one way or another.
Syllabus design and
corpora have been used extensively to provide more accurate descriptions of
language use, a number of scholars have also used corpus data directly to look
critically at existing TEFL (Teaching English as a Foreign Language) syllabuses
and teaching materials. Mindt (1996), for example, finds that the use of
grammatical structures in textbooks for teaching English differs considerably
from the use of these structures in L1 English. He observes that one common
failure of English textbooks is that they teach ‘a kind of school English which
does not seem to exist outside the foreign language classroom’ (Mindt 1996:
232) . As such, learners often find it difficult to communicate
successfully with native speakers. A simple yet important role of corpora in
language education is to provide more realistic examples of language usage that
reflect the complexities and nuances of natural language.
A focus of
the lexical approach to language pedagogy is teaching collocations (i.e. habitual
co-occurrences of lexical items) and the related concept of prefabricated
units. There is a consensus that collocational knowledge is important for
developing L1/L2 language skills. Collocational knowledge indicates which
lexical items co-occur frequently with others and how they combine within a
sentence. Such knowledge is evidently more important than individual words
themselves (Kita and Ogata 1997: 230) and is needed for effective sentence
generation (Smadja and Mc Keown 1990) .
useful in this respect, not only because collocations can only reliably be
measured quantitatively, but also because the KWIC (key word in context) view
of corpus data exposes learners to a great deal of authentic data in a
structured way. Kennedy (2003) discusses the relationship between corpus
data and the nature of language learning, focusing on the teaching of
collocations. The author argues that second or foreign language learning is a
process of learning ‘explicit knowledge’ with awareness, which requires a great
deal of exposure to language data.
to the lexical focus, corpus-based teaching materials try to demonstrate how
the target language is actually used in different contexts, as exemplified in
Biber et al’s (2002) Longman Student Grammar of Spoken and Written
English, which pays special attention to how English is used differently in
various spoken and written registers.
2. Language testing
emerging area of language pedagogy which has started to use the corpus-based
approach is language testing. Alderson (1996) envisaged the following possible
uses of corpora in this area: test construction, compilation and selection,
test presentation, response capture, test scoring, and calculation and delivery
of results. He concludes that ‘the potential advantages of basing our tests on
real language data, of making data-based judgments about candidates’ abilities,
knowledge and performance are clear enough. A crucial question is whether the
possible advantages are born out in practice’ (Alderson 1996: 258-259) . The
concern raised in Alderson’s conclusion appears to have been addressed
satisfactorily now so that nowadays computer-based tests are recognized as
being comparable to paper-based tests (e.g. computer-based versus paper-based TOEFL
A number of
corpus-based studies of language testing have been reported. For example,
Coniam (1997) demonstrated how to use word frequency data extracted from
corpora to generate cloze tests automatically. Kaszubski and Wojnowska (2003)
presented a corpus-driven computer program, Test Builder, for building
sentence-based ELT exercises. The program can process raw corpora of plain
texts or corpora annotated with part-of-speech information, using another
linked computer program that assigns the part-of-speech category to each word
in the corpus automatically in real time. The annotated data is used in turn as
input for test material selection. Indeed, corpora have recently been used by
major providers of test services for a number of purposes :
- as an
archive of examination scripts;
develop test materials;
optimize test procedures;
improve the quality of test marking;
validate tests; and
3. Teacher development
For learners to benefit from the use of corpora, language teachers must
first of all be equipped with a sound knowledge of the corpus-based approach.
It is unsurprising then to discover that corpora have been used in training
language teachers (Allan 1999, 2002; Conrad 1999; Seidlhofer 2000, 2002;
O’Keeffe and Farr 2003). Allan (1999), for example, demonstrates how to use
corpus data to raise the language awareness of English teachers in Hong Kong
secondary schools. Conrad (1999) presents a corpus-based study of
linking adverbials (e.g. therefore and in other words), on the
basis of which she suggests that it is important for a language teacher to do
more than using classroom concordancing and lexical or lexico-grammatical
analyses if language teaching is to take full advantage of the corpus-based
approach. Conrad’s concern with teacher education is echoed by O’Keeffe and
Farr (2003), who argue that corpus linguistics should be included in initial
language teacher education so as to enhance teachers’ research skills and language
While indirect uses such as syllabus design and materials development are
closely associated with what to teach, corpora have also provided valuable
insights into how to teach. Of Leech’s (1997)  three focuses, direct uses of
corpora include ‘teaching about’, ‘teaching to exploit’, and ‘exploiting to
teach’, with the latter two relating to how to use. Though with a number of
restricting factors, direct uses have so far been confined largely to learning
at more advanced levels, for example, in tertiary education, whereas in general
English language teaching, especially in secondary education(Braun 2007) ,
the direct use of corpora is ‘still conspicuously absent’ (Kaltenböck and
Mehlmauer-Larcher 2005) .
about’ means teaching corpus linguistics as an academic subject like other
sub-disciplines of linguistics such as syntax and pragmatics. Corpus
linguistics has now found its way into the curricula for linguistics and
language related degree programmers at both postgraduate and undergraduate
levels in many universities around the world. ‘Teaching to exploit’ means
providing students with ‘hands-on’ know-how, as emphasized in McEnery, Xiao and
Tono (2006) , so that they can exploit corpora for their own purposes. Once
the student has acquired the necessary knowledge and techniques of corpus-based
language study, the learning activity may become student centre. ‘Exploiting to
teach’ means using a corpus-based approach to teaching language and linguistics
courses (e.g. sociolinguistics and discourse analysis), which would otherwise
be taught using non-corpus-based methods.
focuses of ‘teaching about’ and ‘exploiting to teach’ are viewed as being
associated typically with students of linguistics and language programmers,
‘teaching to exploit’ relates to students of all subjects which involve
language study and learning, who are expected to benefit from the so-called
data-driven learning (DDL) or ‘discovery learning’.
of how to use corpora in the language classroom has been discussed extensively
in the literature. With the corpus-based approach to language pedagogy, the
traditional ‘three P’s’ (Presentation – Practice – Production) approach to
teaching may not be entirely suitable. Instead, the more exploratory approach
of ‘three I’s’ (Illustration – Interaction – Induction) may be more
appropriate, where ‘illustration’ means looking at real data, ‘interaction’
means discussing and sharing opinions and observations, and ‘induction’ means
making one’s own rule for a particular feature, which ‘will be refined and
honed as more and more data is encountered’ (Carter and McCarthy 1995: 155)
. This progressive induction approach is what Murison-Bowie (1996: 191)
would call the interlanguage approach: namely, partial and incomplete
generalizations are drawn from limited data as a stage on the way towards a
fully satisfactory rule. While the ‘three I’s’ approach was originally proposed
by Carter and McCarthy (1995) to teach spoken grammar, it may also apply to
language education as a whole .
It is clear
that the exploratory teaching approach focusing on ‘three I’s’ is in line with
Johns’ (1991) concept of ‘data-driven learning (DDL)’. Johns was perhaps among
the first to realize the potential of corpora for language learners (Higgins
and Johns 1984). In his opinion, ‘research is too serious to be left to the
researchers’ (Johns 1991: 2). As such, he argues that the language learner
should be encouraged to become ‘a research worker whose learning needs to be
driven by access to linguistic data’ (ibid) .
Data-driven learning can be either teacher-directed or learner-led (i.e.
discovery learning) to suit the needs of learners at different levels, but it
is basically learner-centred. This autonomous learning process ‘gives the
student the realistic expectation of breaking new ground as a “researcher”,
doing something which is a unique and individual contribution’ (Leech 1997: 10)
. It is important to note, however, that the key to successful data-driven
learning, even if it is student-centred, is the appropriate level of teacher
guidance or pedagogical mediation depending on the learners’ age, experience,
and proficiency level, because ‘a corpus is not a simple object, and it is just
as easy to derive nonsensical conclusions from the evidence as insightful ones’
(Sinclair 2004: 2). In this sense, it is even more important for language
teachers to be equipped with the necessary training in corpus analysis .
Johns (1991) identifies three stages of inductive reasoning with corpora
in the DDL approach: observation (of concordance evidence), classification (of
salient features) and generalization (of rules). The three stages roughly
correspond to Carter and McCarthy’s (1995) ‘three I’s’. The DDL approach is
fundamentally different from the ‘three P’s’ approach in that the former
involves bottom-up induction whereas the latter involves top-down deduction.
The direct use of corpora and concordance in the language classroom has been
discussed extensively in the literature, covering a wide range of issues
including, for example, underlying theories, methods and techniques, and
problems and solutions .
Alderson, J.C. (1996). Do corpora have a role in language assessment? In J.A.
Thomas &M.H.Short (Eds.), Using corpora for language research (pp. 248-59).
2. Biber, D., Johansson, S., Leech, J., Conrad, S. &Finegan, E.
(1999). Longman grammar of spoken and written English. London: Longman.
3. Biber, D., Leech, G. and Conrad, S. (2002) Longman Student Grammar
of Spoken and Written English. Harlow: Longman.
Braun, S. (2007) ‘Integrating corpus work into secondary education: From
data-driven learning to needs-driven corpora’. ReCALL 19/3: 307-328.
Carter, R. and McCarthy, M. (1995) ‘Grammar and the spoken language’. Applied
Linguistics 16/2: 141-158.
Hunston, S. (2002) Corpora in Applied Linguistics. Cambridge: Cambridge
7. Kaltenböck, G. and Mehlmauer - Larcher, B.
(2005) ‘Computer corpora and the language classroom: On the potential and
limitations of computer corpora in language teaching’. Re CALL17: 65-84.
Kennedy, G. (2003) ‘Amplifier collocations in the British National Corpus:
Implications for English language teaching’. TESOL Quarterly 37/3:
Leech, G. (1997). Teaching and language corpora: A convergence. In A. Wichmann,
S. Fligelstone, T. Mc Enery& G. Knowles (Eds.), Teaching and language
corpora (pp. 1-23). New York: Addison Wesley Longman.
10. McEnery, A. and Xiao, R. (2005) ‘Help or help to: What do corpora
have to say?’ English Studies 86(2): 161-187.
11. McEnery, T., Xiao, R. and Tono, Y. (2006) Corpus-Based Language
Studies: An Advanced Resource Book. London: Routledge.
12. Tony McEnery, Richard Xiao. What Corpora Can Offer
in Language Teaching and Learning, 2011.
Mindt, D. (1996) ‘English corpus linguistics and the foreign language teaching
syllabus’ in J. Thomas and M. Short (eds.) Using Corpora for Language
Research, pp. 232-247. Harlow: Longman.
Sinclair, J. (1987) Collins COBUILD English Language Dictionary. London:
Table of contents: The Kazakh-American Free University Academic Journal №7 - 2015