Resources

Resources and tools for language researchers

The following language sciences-related tools and resources are recommended or developed by Cambridge Language Sciences network members. All resources are freely accessible and available online by following the links. The resources are currently listed in alphabetical order. To search the list you can use ‘Ctrl F’ on your keyboard and type in the relevant keywords.

This list is designed to help researchers in the language sciences share resources across disciplines. We very much welcome further contributions. If you would like to suggest additional tools and resources, please use the link below.

RECOMMEND ADDITIONAL RESOURCES

Pico Small Language Model

Contributor(s) Paula Buttery, Richard Diehl Martinez, Guy Emerson

Pico is a multi-purpose language modelling framework designed to support AI specialists in developing small well-founded performant language models. These models require fewer computing resources and can be trained on small, confidential, or bespoke datasets.

CEPOC: The Cambridge Exams Publishing Open Cloze dataset

Contributor(s): Paula Buttery

CEPOC is the first dataset of open cloze tests for learners of English at different CEFR levels. The tests in CEPOC have been designed and calibrated following strict procedures and are part of preparation materials for well-known English proficiency examinations. CEPOC is free to use for research purposes. For a description of the dataset see: LREC 2022 conference proceedings (Felice, Taslimipoor, Andersen & Buttery, 2022).

DeliData (Deliberation Enhancing Data)

Contributor(s): Andreas Vlachos

DeliData, a dataset for deliberation in multi-party problem solving, is the first publicly available corpus of small-group problem-solving dialogues. The aim of the corpus is to facilitate the development of dialogue systems interacting with a group of humans solving a task. The corpus was developed by Andreas Vlachos (PI) & Georgi Karadzhov (Dept. of Computer Science & Technology) with Tom Stafford (Dept. of Psychology, University of Sheffield), funded by the Cambridge Language Sciences Incubator Fund.

DELPH-IN web demo

Contributor(s): Ann Copestake, Guy Emerson

An online demo for DELPH-IN grammars, where you can input a sentence and view syntactic and semantic analyses. Grammars are available for English (ERG), Japanese (Jacy), Mandarin Chinese (Zhong), and Indonesian (Indra). Syntactic analyses are displayed as parse trees. Semantic analyses can be displayed either as logical forms (MRS) or as dependency graphs (DMRS). DELPH-IN grammars and DELPH-IN software are all open source.

DIALLS Dialogue Progression Tool

Contributor(s): Fiona Maine

DIALLS is a three-year European project which has focused on teaching children in primary and secondary schools the dialogue skills needed to engage together with tolerance, empathy and inclusion. The dialogue progression tool was developed as part of this project to support assessment and planning for teachers.

DIALLS multi-lingual corpus of classroom data

Contributor(s): Fiona Maine

This is an open-access multilingual corpus of classroom data from seven countries where classes were recorded engaging in a programme of lessons for cultural literacy learning in pre-primary, primary and secondary classrooms. The corpus was developed as part of DIALLS, a three-year European project which has focused on teaching children in primary and secondary schools the dialogue skills needed to engage together with tolerance, empathy and inclusion.

EFCAMDAT Cleaned Subcorpus

Contributor(s): Itamar Shatz

The EFCAMDAT Cleaned Subcorpus contains texts written by English learners as part of an online English program (originally available in the EFCAMDAT). The two samples in the subcorpus have been post-processed to remove many text formatting artifacts, non-English texts, duplicate scripts, and scripts of minimal length from the original data. Additionally, all scripts produced under different task prompts within certain single units have been automatically recategorized according to their specific topics. Overall, the subcorpus contains ~723,000 texts written by learners from 11 nationalities, at varying levels of proficiency (A1-C1 on the CEFR scale, 1-15 on the EF scale). The data is available on the EFCAMDAT site under the 'Resources' tab.

English Language Online Resource

Contributor(s): Laura Wright

Written by Dr Laura Wright. Developed by the Language Centre for the Faculty of English.

English Vocabulary Profile and English Grammar Profile

Contributor(s): Ben Knight

These are actually two resources - both accessible through this site. They are useful for people researching the learning of English as an additional language because they give an indication of the difficulty of vocabulary items or grammar points. 'Difficulty' is related to the Common European Framework of Reference for Languages - an international standard across all languages. These can be vital for investigating the impact of different learning conditions or interventions on language learning success.

Interactive Atlas of Romance Intonation

Contributor(s): Brechtje Post

Presents audio and video materials for the study of intonation of different Romance languages. Such materials are utterances representing different sentence-types, as well as conversations and interviews. These materials are accessible by means of interactive maps of Europe and the Americas. In addition to this, the Atlas offers a selection of resources available online about the intonation of Romance languages.

Cambridge Global Humanities: multilingualism

Contributor(s): Ioanna Sitaridou

The Global Humanities Initiative aims to transform the way we teach, research and think about the Humanities by bringing different cultures and perspectives into dialogue with each other to create a more diverse and globally aware research agenda and curriculum. The website includes information about related projects and a link to the Global Humanities Network site.

MRC Cognition & Brain Sciences Unit (CBU) Methods Days

Contributor(s): Olaf Hauk

Recordings of the MRC CBU Methods Days are available at the following links:

Methods Day 2020
Methods Day 2019
Methods Day 2018
Methods Day 2017

Speak and Improve

Contributor(s): Cambridge Assessment English

Speak & Improve is a free tool for learners of English that marks speaking in seconds, giving an accurate grade on the internationally-recognised CEFR scale.

Surayt-Aramaic Online

Contributor(s): Naures Atto

This is a unique online language course in Surayt-Aramaic, which is enlisted as a 'severely endangered' language by UNESCO. The course is provided at beginner (A1-A2) and intermediate levels (B levels). A reader/digital corpus is prepared to support the C-level learners. To support the online course, two mobile applications are also developed. The course is an open source. The course methodology is exemplary for other endangered languages, aiming directly to empower them with modern pedagogical learning technologies and methods.

The Cambridge Oracy Skills Framework

Contributor(s): Neil Mercer

The Oracy Skills Framework (OSF) specifies the various skills people need to develop to deal with a range of different talk situations. The framework has been developed by drawing on available existing resources and research, and in consultation with a range of experts. The OSF is designed to help researchers, policy makers, school leaders, teachers and pupils understand the physical, linguistic, cognitive and social/emotional skills that enable successful and effective spoken communication. It is based on research carried out in the Cambridge Faculty of Education, funded by the Education Endowment Foundation (EEF) and carried out in collaboration with the charity Voice 21. It has already been used extensively in the UK and abroad.

The CrowdED Corpus

Contributor(s): Andrew Caines

Crowdsourced speech corpus of English by native speakers and German/English by bilinguals answering business-topic questions of the type found in language learning oral exams. Contains soundfiles and annotated transcriptions. Reported in the proceedings of the Language Resources & Evaluation Conference 2016. Funded by Crowdee and CrowdFlower. In 2020 corrected transcriptions and grammatical error annotations were added for a subset of 383 of the English recordings. This work was reported in the proceedings of COLING 2020. Supported by Cambridge Language Sciences Incubator Fund, the Isaac Newton Trust, and Cambridge Assessment, University of Cambridge.

The ERRor ANnotation Toolkit (ERRANT)

Contributor(s): Christopher Bryant, Mariano Felice

The main aim of ERRANT is to automatically annotate parallel English sentences with error type information. Specifically, given a source and target sentence pair, ERRANT will extract the edits that transform the former to the latter and classify them. It was developed for Grammatical Error Correction, but can be applied to any sequence of parallel text.

The Gersum Project: Database

Contributor(s): Richard Dance

A fully annotated and searchable database of more than 900 words for which Old Norse etymological input has been claimed, found in a corpus of major late Middle English alliterative poetry.

The Teacher-Student Chatroom Corpus

Contributor(s): Andrew Caines

A collection of one-to-one written English lessons between qualified teachers and learners of English in an online chatroom. The work is described in a paper published at NLP4CALL 2020 and the data is available by application here. Supported by Cambridge Language Sciences Incubator Fund, the Isaac Newton Trust, Cambridge Assessment and University of Cambridge.

We Speak Multi

Contributor(s): Elspeth Wilson

This resource is for multilingual families and expectant parents, and for practitioners working with them, especially antenatal teachers. It helps them to think through expectations and practices about speaking more than one language in the home and community, with activity suggestions and links to further evidence-based resources. It is developed by Cambridge Bilingualism Network.

Webinar on conducting psychology experiments remotely

Contributor(s): Vicky Leong

A webinar on conducting psychology experiments remotely where they would usually be lab-based.

Write and Improve