skip to primary navigationskip to content

Research Projects

Current projects

The effect of bilingualism on the linguistic and cognitive development of children with Autism Spectrum Disorders

Drasco Kascelan (Dept of Theoretical and Applied Linguistics), under the supervision of Dr Napoleon Katsos (DTAL) and Dr Jenny Gibson (Fcaulty of Education.

My research aims to investigate certain aspects of cognition and pragmatics in bilingual children with autism spectrum conditions (ASC). Specifically, while neurotypical bilinguals tend to show advantages in pragmatic competence, executive functions, and the Theory of Mind, monolingual individuals with autism seem to be impaired in these areas. In general, families who live in bilingual communities are not encouraged to raise their children with ASC bilingually. This is mainly due to the perceived detrimental effects of bilingualism. However, current literature lacks research on bilingual children with ASC, which makes it hard to conclude what the real effects of bilingualism are on cognition and language development in children with ASC. Hence, this study will examine these aspects of language and cognitive development so as to give a clearer picture of bilingualism in relation to ASC. 

The Linguistic and Cognitive Development of Bilingual Children with Attention Deficit Disorders

Curtis Sharma (Dept of Theoretical and Applied Linguistics), under the supervision of Dr Napoleon Katsos (DTAL) and Dr Jenny Gibson (Fcaulty of Education.

Broadly, my research aims to examine the interaction between bilingualism and traits of Attention Deficit/Hyperactivity Disorder (ADHD) in primary school-aged children. One the one hand, I want to find out how this interaction affects aspects of language learning and use, such as pragmatics. On the other hand, ADHD has been linked to deficits in high-level executive functions (for example, attention, working memory, and inhibition) in the brain, which in turn appear to be enhanced in bilinguals. I aim to examine whether enhanced executive functions have any impact on any type or aspect of ADD evident in bilingual children as described above. The vast majority of the literature focuses on ADD in monolinguals, with very little research investigating the intersection of bilingual studies and ADD. Hopefully, this new and exciting investigation will not only add to our understanding of the area, but also yield some benefit to at least some individuals with ADD.

Project suggestions

ALTA Institute

For further information see

Automated error detection for EFL texts

Proposer: Ted Briscoe

Supervisors: Teb Briscoe (with Marek Rei and Zheng Yuan)

There has been a great deal of academic and commercial interest in detecting and often also correcting errors in texts produced by speakers of English as a further language (EFL). Most research work has focussed on learning classifiers for article (*a information is good and preposition errors (*We sat at the sunshine) from well-formed English text, because there is plenty of the latter and these are two common types of error (see Leacock et al for a recent overview). More recently, there has been work on content-content word errors (*big conversationlong?/important?), agreement (*people definitely is angry), and sequential interacting errors (*the people is helping made the revolution The people helped make the revolution / the people are helping to make the revolution?).

Two substantial error-annotated datasets of learner texts have been made publically available and used as the basis for 3 shared task competitions evaluating the performance of automated error detection and correction (Ng, et al).

All these datasets have been automatically tokenised, part-of-speech tagged and parsed, and include error coding and often other metadata such as the native language of the writer -- all represented in XML format.

Most approaches to error detection rely on contiguous contextual words to detect errors so tend to miss longer distance errors, e.g. those involving word order or agreement. Several researchers have tried to use parsers or parse probabilities to identify and correct errors (e.g. Ivanova and van Noord).

Another related way to tackle long distance errors might be to use a dependency language model (DLM) built from dependency / grammatical relations to predict when a specific dependency has low probability and might be an error. The DLM could be count-based (Gubbins and Vlachos 2013) or neural (LSTM, RNN,.. e.g. Mirowski and Vlachos, 2015, Rei, 2015).

The project can be undertaken in most programming languages and will utilise machine learning and language modelling toolkits. It would suit students taking modules L95 and L101.

Determining word difficulty in context

Proposer: Marek Rei

Supervisors: Marek Rei, Ronan Cummins and Ted Briscoe

Second language (L2) learners of English acquire a wider and more advanced vocabulary as their knowledge and proficiency of the language increases. The Cambridge Advanced Learner Dictionary (CALD) contains definitions of word senses that should be known at different levels of proficiency (as measured by the levels of the Common European Framework of Reference for Languages: Learning, Teaching, Assessment - CEFR). However, the coverage of the CALD dictionary is incomplete and many word senses are not classified (labelled) according to any CEFR level. For example, given the following excerpt from the dictionary, the word austerity (in the following sense) might need to be classified at an appropriate CEFR level:

austerity (noun)
  a difficult economic situation caused by a government reducing the amount of money it spends: 
  People protested in the streets against austerity
  The government today announced new austerity measures.

This project aims to build a classifier using the word definition and other information in the dictionary entry to correctly label its difficulty level. Some simple techniques might involve simply using the frequency of a word in a background corpus as a measure of difficulty. More sophisticated approaches to assigning a difficulty level to an unlabeled word might involve matching the context of the word to contexts of words where the CEFR level is known. Regardless, there is a lot scope in this project to apply distributional semantics and neural embeddings in order to measure the relatedness between word senses and contexts. Given that the labels are graded on an ordinal scale, research could extend into applying ordinal regression techniques (learning to rank). The project can explore training existing machine learning libraries (support vector machines, decision trees, K-nearest neighbour algorithms), or implement a custom neural network model for this task. Appropriate evaluation will also be a key component of this project. The CALD dictionary is available within the ALTA group and so this is a project that has many potential applications in the area of language learning.

Deep representation learning for automated text scoring

Proposer (and supervisor): Helen Yannakoudakis

Automatic text scoring (ATS) focuses on automatically analysing, assessing and scoring the quality of writing. The challenge of ATS is to identify textual features that correspond to different aspects of writing (e.g., grammar, style, vocabulary, coherence, and so on) and that are indicative of someone's writing competence under specific marking criteria.

Deep Artificial Neural Networks (ANNs) have recently attracted wide-spread attention as they have been shown to outperform alternative machine learning methods, such as kernel machines (Schölkopf et al., 1998 and Vapnik, 1995), in numerous applications (Schmidhuber, 2015). Deep network architectures can automatically learn features at multiple levels of abstraction, potentially capturing higher-level infomation that humans may not know how to encode in terms of textual features extracted from the input (Bengio, 2009).

This project aims to 1) investigate the use of Recurrent Neural Networks (RNNs) and 2) Long Short Term Memory Networks (LSTMs) (Le, 2015) for automated text scoring, with an extension to bidirectional architectures, time permitting. We will use Torch to develop the networks, which is a simple, easy-to-use and efficient Matlab-like environment for developing machine learning algorithms (Collobert et al.).12345 We will also use the pre-trained word2vec skip-gram model to initialise the networks (Mikolov et al., 2013).6 Finally, we will 3) compare the ANNs with an existing baseline approach to ATS that does not use representation learning (Yannakoudakis et al., 2011) (baseline source code will be provided).

Artificial generation of word choice errors

Proposer: Ekaterina Kochmar

Supervisors: Ekaterina Kochmar, Mariano Felice, Ted Briscoe

Error detection systems using machine learning techniques rely on the availability of a considerable amount of data. It has been shown that training classifiers on texts produced by non-native speakers with the real errors present in text is beneficial because the classifier is able to learn the error patterns and the corresponding probabilities from the data. However, learner error-annotated data is not always available and building annotated learner corpora is expensive. Moreover, the error class is underrepresented in learner data with over 90% of word usage being correct. A viable solution to this problem is to use artificial corpora which are cheaper to produce and can be tailored to the needs of the research. The key to successful artificial error generation is to produce data that mimics real learner errors as much as possible.

Most previous research has focused on function words and grammatical errors (Felice and Yuan, 2014; Rozovskaya and Roth, 2010a, 2010b; Foster and Andersen, 2009), whereas this project will look into artificial generation of errors in content words focusing on adjectives, nouns and verbs. Our preliminary experiments suggest that random error injection without taking the probabilities of the confusion patterns into account is not likely to improve error detection results. Instead, an error generation algorithm should make use of (a) the confusion patterns, and (b) their probabilities, both of which can be learned from the learner data available. The reliability of error patterns statistics can further be improved with the error inflation method (see Rozovskaya and Roth, 2012). Native language of a learner is another important factor influencing the (incorrect) word choice, and it can also be taken into account.

This project can be undertaken in any programming language and will utilise machine learning, language modelling and statistical machine translation toolkits (such as Mallet, MegaM, SVMlight, Weka, Moses, SRILM, etc). It will suit students taking modules L90, L95 and L101

Reverse dictionary search using neural network embeddings for vocabulary acquisition

Proposer: Ekaterina Kochmar

Supervisors: Ekaterina Kochmar, Felix Hill, Ted Briscoe

This project will aim to combine the recent research in reverse dictionary search using neural network embeddings (Hill et al., 2015; Zock and Bilac, 2004) and learner studies aimed atvocabulary acquisition.

Distributional semantics models have been successful in capturing the meaning of individual words, but representation of phrases and sentences has proved to be much harder. Recent advances in neural network embedding models helped alleviate this problem and represent the meaning of word sequences of arbitrary length, linking word level semantics to that of longer phrases and sentences.

One interesting and useful application of these models is reverse dictionary search (Zock and Bilac, 2004), which aims to help language producers (writers/speakers, of which language learners are a perfect example) to find appropriate words to express the idea or concept they have in mind. Hill et al. (2015) have recently showed that neural network embeddings can be used for this purpose and can help link dictionary definitions to words.

English vocabulary represents a challenge for language learners both during language synthesis (generation) and language analysis (comprehension). Certain words may not be familiar to language learners at certain levels of their language proficiency (e.g., beginners, intermediate) even if they may correspond to other words the learners already know at this point: for example, the learners at the beginners level may be unfimilar with the word notion, but know the word idea which is quite close in meaning to the former. In that case, a vocabulary acquisition tool can link the words to each other and help learners expand their vocabulary through the words they already know. The study by Bergsma and Yarowsky (2012) also suggests that words that may seem to be of similar readability/difficulty to a native English speaker will in fact be of different levels of readability for non-native speakers. This, among other factors, also depends on their first language (L1): for example, Bergsma and Yarowsky (2012) show that the words propose and terminology are more familiar to Chinese speakers than the words claim and notation.

This project will look into how tools aimed at vocabulary acquisition facilitation can be implemented. We will use the Cambridge Advanced Learner's Dictionary, which assigns readability levels to word entries, and map words with similar meaning and of different readability levels via their definitions using reverse dictionary search. A vocabulary acquisition tool can then be used to facilitate reading (e.g., by highlighting the words deemed to be difficult for learners at a particular level and providing a list of similar words at lower difficulty levels) as well as writing (e.g., by suggesting words that express writers' ideas). 

This project can be undertaken in any programming language and will utilise machine learning and language modelling toolkits (e.g., word2vec). It will suit students taking modules L101, R222 and L42 (Machine Learning and Algorithms for Data Mining, though not directly on NLP, might be useful for understanding ML algorithms).



Cambridge Language Sciences is a virtual centre for language researchers at the University of Cambridge. 

Our mission is

  • to connect a diverse research community
  • to create increased opportunities for interdisciplinary collaboration
  • to advance knowledge through the cross-fertilisation of ideas
  • to develop external partnerships
  • to equip the next generation of researchers with the knowledge, experience and skills for interdisciplinary research.

Click here to join us or to .

Sign up for the language sciences mailing list.