Acquiring Reading Skills in a Foreign Language in a Multilingual, Corpus-Based Environment
Dragoş Ciobanu, Anthony Hartley, Serge Sharoff
Centre for Translation Studies
University of Leeds
Leeds LS2 9JT, UK
(smldc, a.hartley, s.sharoff) @leeds.ac.uk
Abstract Despite several attempts to create an effective methodology to assist adult learners in acquiring reading skills and a working knowledge of a foreign language (L3) typologically similar – or cognate - with a second language (L2) they already have some knowledge of, so far results have not matched expectations. The deliverables of projects in this field are often difficult to port to other learning environments, their implementation often involving many hours of work from a diverse team of experts in linguistics, pedagogy and computing.
At the Leeds University Centre for Translation Studies we have developed a methodology that combines the use of multilingual, comparable ad-hoc corpora, processed with the latest natural language processing (NLP) tools, with our own research into the automatic addition of more complex annotations that can be easily used in a variety of ways to support the rapid acquisition of reading skills in L3.
We also present the initial results of a practical study conducted using an environment we have built according to our own specifications. We argue that our methodology can be adapted and implemented with minimum effort in the case of other combinations of L1, L2 and L3 and that our target users not only acquire reading skills in L3, but also strengthen their command of L2.
1 Introduction
During the relatively short life of computer assisted language learning (CALL) research to date, a certain number of issues have been constantly addressed. Among these, the most popular are the need to keep searching for new language teaching methodologies that build on current advances in language technologies, without ever losing sight of their initial educational goals (Barrière and Duquette, 2002; Borin, 2002); the need to extend experiments beyond the laboratory environment (Chapelle, 2004); the lack of appropriate resources to support effective real-life CALL classes (Grabe and Stoller, 2002: 21); the fact that language tutors, as well as students, often require extensive training in order to master CALL applications (Kol and Schcolnik, 2000); the lack of communication between computer specialists and language tutors (Borin, 2002) and the important effort – both financial and time-wise – that generally needs to be invested in developing CALL applications (Tomoda, 2005).
Our initial research highlighted important limitations regarding the usability of resources delivered at the end of large-scale European projects such as EuroComRom (Klein, Meißner et al., 2002), which targeted reading in the Romance languages, but used and produced paper-based materials and rather limited lists of words. These, despite claims indicating the contrary at least in the case of Romanian, did not provide users with
enough knowledge to “read all the Romance languages right away”, as the motto of the EuroComRom project states.
Moreover, too little attention has been paid solely to the process of reading. The general approach is to observe it in conjunction with at least one of the other three processes: listening, writing and speaking. Consequently, we are not aware of any multilingual environment in which learners can benefit from an effective approach and methodology for acquiring only reading skills in an L3.
In fact, not only is research into reading in a third language (L3) still in its early stages – i.e. identifying common points between learning to read in L2 and L3 –, but researchers in the broader domain of language acquisition often state that the research world is still unclear about what reading really is. This leads to uncertainty as to what helps and what hinders the acquisition of reading skills (Hammadou, 2000; Plass, Chun et al., 2003).
However, it is generally accepted that “extensive word exposure is necessary in order to ensure a deep and solid embedding of new words in the mental lexicon” (Gamper and Knapp, 2001; Grabe and Stoller, 2002: 12) and that the “linguistic characteristics of target language input need to be made salient” (Gamper and Knapp, 2001). Furthermore, despite the fact that “... extensive reading” is deemed to “promote fluency, vocabulary, and background knowledge...” (Pressley cited in Grabe and Stoller, 2002: 91) and that “students learn to read by reading a lot” (Grabe and Stoller, 2002: 90), it seems that not enough effort goes into developing sound methodologies and useful text resources, and that “reading a lot is not the emphasis of most reading curricula” (ibid.).
We are aware that “for the non-native speaker, the invitation to read and write more in English is usually as welcome as a long-distance telephone bill” (Ward, 2004) and we believe the same applies to any non-native speaker invited to read in any unfamiliar language, so much more when the language in question is completely new. Yet we trust that, with enough variety regarding the authentic resources involved in the process of reading, sufficient comprehensible annotations of these resources, a structured presentation according to several relevant criteria, and initial task-based exercises aimed at boosting the users’ confidence regarding their language skills, reading strategies and their familiarity with the learning environment, adult language learners take on this challenge and perform well. Although the testing phase of our project has just started, the initial results support our claims so far.
2 The project
Using ad-hoc trilingual comparable corpora for studying the acquisition of reading skills in a foreign language (L3) represents an original approach to language teaching and learning. In our review of the state of the art in both L3 teaching methodology and CALL applications targeting L3 learners, we have found no references to studies dealing with this subject. The same is equally true for the field of research into second language (L2) acquisition, so our methodology is likely to benefit a wider audience than initially expected.
We also aim to address the issue of scalability, which most of the available CALL applications avoid, and the experiments conducted to date use more authentic
materials than many language learning projects. In short, we are developing and testing a methodology that employs NLP tools to help adult learners acquire reading skills in a foreign language (L3 – in our case Romanian) using comparable corpora in their first language (L1 – English) and another language they know to some extent (L2 - French), which is also typologically similar – or cognate - with L3, that is L2 and L3 belong to the same language family.
2.1.1 The target audience
Identifying the target audience and its needs was the first step in the development of the project. Apart from a wide range of individuals who are interested in both new information and also in the way in which events they are familiar with to a certain extent are presented in other corners of the world, there is also an important body of professionals who would benefit from becoming able to read in a foreign language. Translators are among the first that come to mind: learning quickly to read in a previously unknown language can significantly improve their marketability, as well as help them gain a clearer, more accurate picture of the situation in various regions of the world. Moreover, there is an ever -growing body of academics looking for new sources of information in their field of research. Given access to a methodology that can effectively help them acquire reading skills in foreign languages, they would be able to overcome the language barrier which often keeps academic work at a standstill until relevant research has been translated into one of the few popular languages – namely English or French.
Finally, we chose adult learners who are familiar with the process of learning a second language, mainly because they are likely to make full use of our morphological and semantic annotations, since they “have access to two linguistic systems when acquiring a third language” (Cenoz, 2003) and are also likely to have developed some reading strategies for their L1 and L2.
2.1.2 The methodology
We assembled ad-hoc trilingual comparable corpora of news items available online. We collected 132 articles in Romanian, 100 in French and 105 in English, amounting to individual corpus wordcounts of approximately 50k, 85k and 65k words respectively. During the corpora-building process we had to discard an initial larger set of corpora put together from similarly easily-accessible sources because the Romanian component had been published without diacritics. At the time we believed that we could automatically restore these diacritics, but the results proved insufficiently reliable.
Since the outset of this project, we have aimed to present users with as much useful information as possible in Romanian, French and English. We wanted to avoid the pitfalls of the majority of CALL applications, which, instead of exposing learners to large amounts of authentic linguistic materials, offer them smaller quantities of artificial resources enriched with labour-intensive multimedia annotations.
Consequently, we developed robust Perl scripts that enabled the automatic annotation of our corpora with a wide range of significant information, such as part-of-speech (POS) tags and lemmas – identified at the Romanian Academy Centre for Artificial Intelligence (RACAI) in Bucharest in the case of the Romanian corpus and at
Leeds using TreeTagger in the case of the English and French corpora; synonyms, translation equivalents, definitions and related words – using the aligned English and Romanian WordNets obtained from RACAI, too, and a freely available list of cognates between English and French –, as well as relative frequencies of words in any one article with respect to the entire corpus.
A string-similarity script identifies structurally similar words between Romanian and French and Romanian and English. A subsequent study involving 100 random Romanian words and their automatically-identified structurally similar French and English tokens indicated that in 65% of cases these were cognates, so we decided to include these data in our annotations and then test how useful the readers would find it and how they would also cope with the misleading information presented.
We also built our own multilingual concordancer because we were unable to find one that would work with Romanian diacritics in the first place and also because we wanted to capitalize on our extensive annotations.
This way, from a simple HTML file saved locally (Figure 1), we developed the annotations for our corpora in order to go beyond POS tagging and lemmatization (Figure 2) and capture more complex information (Figure 3).
Figure 1: HTML page in Romanian saved locally
Figure 2: Text extracted from the HTML file, POS tagged and lemmatised
Figure 3: Our full token annotation
With these resources, we were able to write scripts that identify related articles between all project languages: English – L1 -, French – L2 -, and Romanian - L3. Using 4.5 as the bottom threshold for tf.idf scores -, we identified key words for each L3 article.
The formula we used was:
tf.idf (i,j) = (1+log(tfi,j))*log(N/dfi) , where:
• tfi,j = no of occurences of word wi in document dj
• dfi = no of documents in the corpus where the word wi occurs
• N = total number of documents in the corpus (Jones, 1972; Babych and Hartley, 2004)
We then compared them with those of other L3 articles in order to choose pairs with significant overlap. We used the formula
REL=R1imp+R2 imp /Rcom, where:
• R1imp represents the total number of important words for the L3 article R1;
• R2imp represents the total number of important words for the L3 article R2;
• Rcom represents the number of common words between the group of important words for R1 and the one for R2
Our subsequent empirical study indicated that 50 was an acceptable uppermost REL score threshold for identifying pairs of L3 related articles. By ‘acceptable’, we mean that the Romanian native speaker involved in the project checked random pairs and generally found them to have related content.
Every time the tf.idf score of an L3 word matched or exceeded the 4.5 threshold, we also looked to see what linguistic information in L1 or L2 we could store related to that word in that article, as follows: we stored the available L1 equivalents of the L3 word, as well as the L2 equivalents, related and structurally similar words - identified using the English and Romanian WordNets together with the list of English-French cognates.
Then, in order to find related articles between Romanian-French (L3-L2) and Romanian-English (L3-L1), we identified (using 4.5 as the bottom threshold for tf.idf scores again) the key words for each article in L1 and L2; we compared the resulting groups with the groups of L1 and L2 words stored for each L3 important word in each L3 article; we chose only the pairs of articles that scored below 75 when applying the formula
REL’=R1imp+R2 imp /Rcom, where:
• R1imp represents the total number of important words for the L3 article R1;
• R2imp represents the total number of important words for the L1 or L2 article R2;
• Rcom represents the number of common words between the group of important words for R1 and the one for R2
The empirical study that followed found that the reliability of the automatic matching was lower between L3-L1 and L3-L2 than within L3. However, we decided to use the results because we appreciated that, even though some of the matched articles did not have in fact directly related content, the overall learning environment could compensate for this drawback, as users would still be exposed to authentic language and they would still be able to perform complex multilingual concordances, see tokens in context and improve their command of the lexis and morphology of L3, L2 and even L1.
In generating satisfactorily accurate clusters of related articles, our experiment challenged the view that materials used in CALL applications are too small to provide a basis for statistical analysis (Barrière and Duquette, 2002).
The hyperlinks to related articles enable users to read several similar texts and use the background knowledge they gradually accumulate to assist them in the process of checking their own hypotheses about the Romanian lexis, morphology and syntax. This way, we acknowledge and implement the findings of many studies regarding the important role of background knowledge (Hammadou, 2000; Barrière and Duquette, 2002; Sun, 2003; Ariew and Ercetin, 2004) into the design of what our students so far have thought to be a user-friendly and intuitive environment. We improve on current research and CALL applications by suggesting related articles in all three languages, thus enabling users to familiarize themselves with a particular topic in their mother tongue and
then read French and Romanian articles in order to find information and improve their command of these languages.
In order to cover a wider range of preferences, we wrote scripts that grouped our Romanian articles into clusters according to shared key words – and thus very often a common theme – article length, average sentence length and degree of internal repetition at the article level.
We are currently at the testing stage of our methodology. We have built an environment according to our specifications and we are investigating the strategies that adult learners employ in order to start reading in an unknown L3 in general and the extent to which learners find the comparable L2 corpora helpful in particular.
2.1.3 The learning environment
We have developed a two-window, web-based interface that presents users with extensive authentic primary texts, meaningful lexical and grammatical information, dynamic concordances and related texts in all the project languages. NLP techniques retrieve this information automatically and dynamically from the corpora.
Figure 4 represents the main window, in which the Romanian article to be read occupies the largest part of the screen. In the opposite corner, there is a concordance window that supports trilingual concordance searches and prints a confirmation message after each search.
Figure 4: Main course window
However, since researchers often state that guessing words from contexts is not the best way of acquiring vocabulary (Grabe and Stoller, 2002), our environment does much more than simply display the sentences in which the sought word appears, in the language selected by the user. We are capitalising on our comprehensive annotations in
order to present the user with much more meaningful input than other currently available CALL applications.
First of all, directly above the concordance request frame, there is the Romanian information frame, where the user is able to view important information in Romanian about the target word. By scrolling down this frame, he/she sees the token that was entered in the search box, its part of speech, lemma, available Romanian synonyms, definition and related words, as well as whether the token is part of the lists of the most frequent 500 and 2,000 words in the L3 corpus, and decide if it is worth remembering. Furthermore, given that Romanian is an inflected language - just like most other Romance languages – the user can view those other words in the corpus that share the same lemma with the target token, look at their POS tag and start assembling the puzzle represented by the Romanian morphology.
The two frames under the article frame host Romanian and French concordance lines. All concordances represent sentences extracted dynamically from the corpus. Hovering with the mouse over each word in the concordance lines gives information regarding its POS. This way we are providing users with sufficient information and authentic language excerpts to start identifying language patterns in all three languages. Each concordance line ends with a hyperlink to the article the respective sentence was taken from, in case the reader requires even more context.
We also took into account the fact that users may not wish to read many concordance lines in order to check their hypotheses about L3. Consequently, the Romanian information frame ends with a list of the most frequent collocates to the right and left of the target word sorted according to the number of times they occur in our corpus. We then use this information to arrange and display the concordance lines according to these frequent collocates.
If the target word has several POS tags in our corpus, we display the same amount of records with all the useful information we can find for each instance. Moreover, if our annotations indicate the presence of English and French equivalents, related and structurally similar words, we print that in the secondary window (Figure 5) alongside the list of related articles, and then print concordances for the first word belonging to these groups that we can find in our corresponding corpora.
Figure 5: Secondary course window
Our concordancing scripts support a multilingual, multidirectional approach to language learning in that users can search for English and French words as well. In those cases, apart from extracting the sentences in which the target word appears in the relevant corpus, our scripts search the Romanian corpus in order to see whether there are any tokens that include the sought word among the foreign language equivalents, related or structurally similar words. If that is the case, a concordance is performed for that Romanian word, which means that readers are likely to receive comprehensible input in all three project languages regardless of the language the target word belongs to.
2.1.4 The practical study
Once our interface was created and debugged, we initiated the testing phase with the help of 10 volunteers enrolled in the MA in Applied Translation Studies course offered by the Leeds University Centre for Translation Studies. All our subjects are English native speakers who have some knowledge of French – ranging from intermediate to advanced levels.
So far the students attended one 90-minute session in which they became familiar with the functionalities of the learning environment and noticed where they could find useful information in the two windows (Figure 4 and Figure 5). One of our aims is to produce a user-friendly interface. The fact that students were able to complete the exercise with the benefit of only a 10-minute introduction to the environment lends informal support to the hope that the interface is indeed intuitive.
The students had a three-page handout which asked them to carry out various tasks, ranging from the translation of an article’s title and subtitle, the summarisation of its main ideas, finding items of information in the text in answer to specific questions, to the identification of verbs and patterns for the formation of the Romanian past tense and the comparison of these patterns with those met while learning other languages. All the
tasks were performed with the help of the concordancing tool, although this was not always an explicit requirement.
Moreover, they also needed to translate families of words that required them to have identified how Romanian nouns acquire the marks of plural and definiteness – this latter task was even more challenging as, unlike French or English, in which the definite article is a separate word that precedes the noun, in Romanian it is a morpheme which cannot form a word on its own and is generally found at the end of the noun (unless other clitics are required, in which case they are placed after the definite article). Finally, the students had to browse through the Romanian related articles and translate three of their titles, as well as go back to the welcome page, select a Romanian article from the clusters that we had automatically identified according to the four criteria mentioned above, read it and summarise its content.
Although we are now at a very early stage in the evaluation phase, several very encouraging signals were evident. First of all, the intuitiveness of the interface was appreciated. Secondly, the concordance tool was thought very useful. Our analysis of the user logs indicate that all of our subjects resorted to the concordance tool a lot. This fact, coupled with the observation that everyone gave over 60% correct answers and that the majority scored over 90% correct answers, with several maximum scores, too, leads us to believe that our corpus annotations are comprehensive enough to enable users to start reading accurately in an unknown L3 immediately.
Regarding the errors in the students’ replies, most of them occurred in the case of the questions on morphology. The task was to identify the Romanian equivalents of three clusters of English words derived from three lemmas, and the target tokens were the singular and plural forms, with and without definite article. Apart from that, the sentence translation, text summarisation and text scanning tasks were generally performed correctly. The structural similarity between L2 and L3 or L1 and L3 tokens seldom caused errors, as – when required by a translation task -, the majority of the students were able to correctly identify the Romanian conjunction şi as the equivalent of the English conjunction and and not of the French one si.
Feedback from students indicated that they had found the information in the Romanian information frame useful; in several cases, even if the target token did not trigger any links with either English or French, there were synonyms of that word which were cognate with known words in L1 or L2 and they had a significant contribution towards the correct understanding of the article.
However, the user logs also indicated that it was mainly Romanian words that were entered in the concordance search box. It is our intention to diversify the tasks for the next user trials in order to make users even more comfortable with the learning environment and to encourage multilingual exploration.
2.2 Conclusions
Our study gives a first indication that building a motivating, intuitive and user-friendly learning environment based on an innovatory methodology that combines the state of the art in natural language processing and second and third language acquisition research does not necessarily have to be an exhausting enterprise. We demonstrate that time-consuming multimedia annotations and content built by hand can be safely replaced,
in the case of adult learners, with automatically computed linguistic information while maximising at the same time the accuracy of the reader’s comprehension of the foreign text.
We intend to follow and report on the influence of our methodology and learning environment on the subjects’ command of L2, as well as identify the types of tasks that are likely to have the best influence on it. We also aim to implement a simple, user-friendly feedback mechanism that could potentially replace the language tutor in self-learning scenarios.
3 References
Ariew, R. and G. Ercetin (2004). "Exploring the Potential of Hypermedia Annotations for Second Language Reading." Computer Assisted Language Learning 17(2): 237-259.
Babych, B. and A. Hartley (2004). Extending the BLEU MT Evaluation Method with Frequency Weightings. ACL.
Barrière, C. and L. Duquette (2002). "Cognitive-Based Model for the Development of a Reading Tool in FSL." Computer Assisted Language Learning 15(5): 469-481.
Borin, L. (2002). What have you done for me lately? The fickle alignment of NLP and CALL. EuroCALL, Finland.
Cenoz, J. (2003). "The additive effect of bilingualism on third language acquisition: A review." International Journal of Bilingualism 7(1): 71-87.
Chapelle, C. A. (2004). "Technology and second language learning: expanding methods and agendas." System 32(4): 593-601.
French-English Cognates - Vrais Amis. http://french.about.com/library/vocab/bl-vraisamis-a.htm
Gamper, J. and J. Knapp (2001). Adaptation in a Language Learning System. ABIS.
Grabe, W. and F. L. Stoller (2002). Teaching and Researching Reading, Longman.
Hammadou, J. (2000). "The Impact of Analogy and Content Knowledge on reading Comprehension: What Helps, What Hurts." The Modern Language Journal IV(84): 38-50.
Jones, S. (1972). "A statistical interpretation of term specificity and its application in retrieval." Journal of Documentation 28(1): 11-21.
Klein, G., F.-J. Meißner, et al., Eds. (2002). EuroComRom - The Seven Sieves (How to Read All the Romance Languages Right Away). Aachen, Editiones EuroCom.
Kol, S. and M. Schcolnik (2000). "Enhancing Screen reading Strategies." CALICO Journal 18(1): 67-81.
Plass, J. L., D. M. Chun, et al. (2003). "Cognitive load in reading a foreign language text with multimedia aids and the influence of verbal and spatial abilities." Computers in Human Behavior 19: 221-243.
Sun, Y. C. (2003). "Extensive reading online: an overview and evaluation." Journal of Computer Assisted Learning(19): 438-446.
Tomoda, T. (2005). "Developing Sakura - an interactive website for Japanese language learners." CALL-EJ Online 6(2).
Ward, J. M. (2004). "Blog Assisted Language Learning (BALL): Push button publishing for the pupils." TEFL Web Journal 3(1).
source: http://www.leeds.ac.uk/cts/research/publications/leeds-cts-2005-03-ciobanu-hartley-sharoff.pdf
Tidak ada komentar:
Posting Komentar