English
  Top                                             
                                                                    
  
CEEAUS
Corpus of English Essays Written by Asian University Students









Updated 2010/4/10

The English Learner Corpus, CEEAUS2010 β Edition, is freely downloadable from here
Please unzip the downloaded file with the password, " rokko".
You will need a concordancer such as Wordsmith or AntConc to analyze CEEAUS.





CEEAUS2010 Beta Edition Release Note in English (V 201004)
Dr Shin Ishikawa (Kobe University)





1. What is a learner corpus?

 When we use the term 'corpus,' we usually refer to a language database produced by native speakers. For example, the British National Corpus, which includes 100 million words of written and spoken English produced by English (or rather "British") native speakers, has been widely used as a reliable sample of real English. Currently many dictionaries and TESOL materials are based on the BNC or similar native speaker corpora. However, increasing numbers of researchers have recently become interested in the "learner corpus" as well as the conventional native speaker corpus. Why do we need to examine learner English, which may include errors and awkward expressions?

 In the study of SLA or second language acquisition, the special type of language existing between one's first language (L1) and second language (L2) is called interlanguage (IL). For example, when Japanese learners of English speak or write in English, their oral or written utterance is defined as an IL. It is neither Japanese as their L1 nor native-like English as their L2, but a very unique type of language seen between them. Analysis of the learner corpus helps elucidate the features of the interlanguage and the psychological procedure involved in acquiring the second language.

 Learner corpus studies are also beneficial for foreign language teaching. By comparing learner corpus with native speaker corpus, we can identify learners' typical errors and overused or underused patterns. Most of the recent ESL dictionaries compiled in UK utilize various findings obtained from learner corpus analysis.





2. Various learner corpora

 Various learner corpora have been compiled to date. The largest and most influential is the International Corpus of Learner English (ICLE) (Granger, et al. 2002; Granger, et al., 2009).

 Also, several corpora focus exclusively on Japanese learners of English (JLE). The Japanese EFL Learner Corpus (JEFLL) (Tono, 2007) includes essays written by Japanese junior high and high school students, the Nagoya Interlanguage Corpus of English (NICE) (Sugiura, 2007) includes those by Japanese college students, and the NICT JLE Corpus (Izumi, Uchimoto, & Isahara, 2004) collects transcribed data of speech utterances in the English oral proficiency interview (OPI) test.

  Each of the existing learner corpora has its own merits, but when using them for what Granger (2002) calls contrastive interlanguage analysis (CIA), namely, a comparison between native speakers (NS) and non-native speakers (NNS) or that between NNSs with different L1s, we must be careful about how we interpret the obtained findings, since differences in the writing condition might influence the language used in the essays (Sugiura, 2007). It goes without saying that it is quite difficult to compare an academic essay concerning the global economic depression written by an NS and a short casual essay about a summer vacation by an NNS and then discussing aspects of the NS-NNS gap, which might actually be just the topic gap.





3. What is CEEAUS?

  The Corpus of English Essays Written by Asian University Students (CEEAUS) (Ishikawa, 2008; Ishikawa, forthcoming) is a new learner corpus compiled and released by the author. CEEAUS currently consists of five modules, the CEEJUS module collecting essays written by Japanese university students, the CEECUS module collecting those by Chinese counterparts, the CEENAS module collecting those by English NS, and CJEJUS module collecting Japanese essays written by Japanese university students.

             

CEEAUS is intended to be used as a database for a multi-layered contrastive interlanguage analysis (MCIA) (Ishikawa, 2010). With CEEAUS, you can compare (i) Japanese learners of English and English native speakers, (ii) Chinese learners of English and English native speakers, (iii) Japanese learners of English and Chinese counterparts, and (iv) English essays and Japanese essays written by Japanese native speakers respectively.

  Unlike other existing learner corpora, CEEAUS has many unique features:

            
[1] Large-scaled data collection
 CEEAUS has collected more than 200 thousand words in total. Though the volume of data of Chinese learners and English native speakers is relatively small. CEEJUS is one of the largest corpora ever compiled of Japanese learners of English at intermediate and advanced levels.
  

# of Words

CEEJUS

CEECUS

CEENAS

CJEJUS

Samples

770

92

146

50

Tokens

169654

20367

37173

15344

Types

4800

1818

3797

1653

Lemmas

3602

1472

2884

-----


[2] Strict control of writing conditions
 As mentioned before, control of writing conditions is vital when compiling a learner corpus. In CEEAUS, the number of topics is restricted to two.

(A) “It is important for college students to have a part time job.”
(B) “Smoking should be completely banned at all the restaurants in the country.”

Writers must clearly show whether they agree or disagree with the theses and support their own discussion with appropriate reasons and illustrative examples. Half the essays included in CEEAUS are written about topic A, and the remainder about topic B. In addition, the length of the essays (200 to 400 tokens) and the time allowed for writing (20 to 40 mins) are also controlled (Ishikawa, 2008). Dictionary use was not permitted.

[3] Demographic data collection
 The data balance is carefully considered in CEEJUS as the main module of the same. All the writers took the TOEIC® test or mock TOEIC® test, and their essays are stratified into four L2 proficiency groups according to the actual or estimated test scores.

  Upper (700+; 9,008 tokens in total)
  Semi-upper (600+; 57,452 tokens)
  Middle (500+; 85,614 tokens)
  Lower (500-; 17,580 tokens)

Thus, CEEJUS can also be used to compare non-native speakers at varying levels of English proficiency.

             
              The regression model for estimation of the TOEIC(R) score from the mock test score

 The demographic balance is also considered in CEENAS. Approx. 40% of the writers are American, and 15% each British and Australian. Age is also controlled: Approx. 35% of the writers are in their 30s and 28% in their 20s.

       
      Nationality of the writers                            Age of the writers

[4] Tagging
 Some of the data in CEEAUS is assigned with a POS tag and Semantic tag developed by Lancaster University, UK, which makes facilitates a closer analysis of the vocabulary and grammar used by native or non-native speakers.





4. Future of CEEAUS?

 CEEAUS is now being enlarged into a newer and bigger corpus, ICNALE, the International Corpus Network of Asian Learners of English. We plan to collect data in the following countries and areas.

 Korea
 China
 Chinese Taipei
 Singapore
 Hong Kong
 Thai
 Malaysia
 Indonesia

 Now we are seeking cooperative researchers. If you live in the areas above, and can help me with data collection, please contact Dr. Ishikawa.





5. Using CEEAUS

 CEEAUS is freely distributed under Creative Commons (CC) license. Please follow the CC guidelines when using CEEAUS for your study.

Creative Commons License


 

神戸大学石川慎一郎研究室 (C)2006-2011