Suchir Salhan

Suchir Salhan is a PhD Candidate in the Department of Computer Science & Technology at the University of Cambridge (Gonville & Caius College) researching Small Language Models and Cognitively-Inspired AI. He previously completed a BA and MEng in Computer Science & Linguistics at Gonville & Caius College, obtaining a “starred First” (Class I with Distinction) and a Distinction respectively.

Biography

My interdisciplinary background in Computer Science, Cognitive Science, and Linguistics drives my interest in leveraging insights from human cognition to develop AI systems that are interpretable, fair, and equitable.

I’ve had a long fascination with the intersection of language and computation—how humans have developed the capability to acquire natural language to communicate, learn, and reason, despite the diversity of linguistic systems; and how we might build machines that can do the same. I arrived in Cambridge in 2020 to pursue a BA and MEng in Computer Science & Linguistics at Gonville & Caius College, Cambridge, where I earned a “starred First” and a Distinction.

During my time as an undergraduate, I explored code-switching with Dr Li Nguyen, worked on multimodal vision-language models with Prof Nigel Collier and Fangyu Liu (now at Google DeepMind), and participated in a funded internship at the ALTA Institute. I probed models like CLIP to understand their semantic representations, experimented with Nearest Neighbour Algorithms for Offline Imitation Learning, and investigated Explainable AI, Argumentation Mining, and Shortcut Learning in NLP. At the same time, my linguistic interests – mainly in typology and theoretical linguistics (syntactic theory, morphology, and phonology)—taught me the deep diversity and structure of human language, and inspired me to think about how AI might better reflect this complexity.

These experiences have shaped my current PhD work, where I aim to build AI systems that are both powerful and cognitively inspired, bridging insights from human language and computation. My Masters Thesis focused on the BabyLM Shared Task to train Small Language Models using acquisition-inspired strategies on “cognitively-plausible” corpora (e.g., child-directed speech) for several languages.

My PhD work now connects the BabyLM paradigm with the fast-moving Language Modelling ecosystem. While Large Language Models (LLMs) are increasingly used in high-stakes applications—such as assessing human performance—they often lack steerability, alignment, and interpretability. I work to address this by developing Cognitively-Inspired Small Language Models (SLMs). These SLMs can guide and calibrate LLM behavior in multi-agent environments, aligning AI outputs with user preferences and domain-specific tasks. By explicitly modeling underrepresented populations of speakers and learners, these models help make AI systems more equitable, robust, and human-aligned.

Research

Small Language Models: The viability of 'Small LMs' as a coherent research programme relies on a successful consideration of efficiency, acceleration and architectural questions in pretraining.

Our group released PicoLM, the Cambridge Small Language Model & Learning Dynamics Framework in March 2025 to investigate these research questions. Check out the YouTube Video put together by Zeb Goriely: Introducing PicoLM | YouTube.
I have worked on dynamic tokenization and supported similar projects in the NLIP group and the L65 (Geometric Deep Learning) course on the MPhil ACS.

Cognitively-Inspired AI: The emergent capabilities of Transformers are subject to a great deal of interpretability work, however there is a clear mismatch between human language acquisition (which is data-efficient in many regards) and the data-hungriness of Transformers. I am personally very invested in research questions that draw on insights from language acquisition in the context of the BabyLM Shared Task, leading and working as part of teams working on the Multimodal, Multilingual and Interaction Tracks of the Shared Task.

Publications

Key publications:

Key Themes: (i) Cognitively-Inspired Design, Interpretability and Evaluation (♣), (ii) Language Model Pretraining (✰), (iii) Multilinguality (✦), (iv) Tokenization (✿), (v) Alignment and Interaction (✒︎) and (vi) Cognitive Science and Linguistics (♦️).

Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies. 2024. Suchir Salhan, Richard Diehl-Martinez, Zebulon Goriely, Paula Buttery. 2024. BabyLM Shared Task (Paper Track), Conference of Natural Language Learnning (CoNLL). Poster Presentation at EMNLP (Miami, FL, USA, November 2024). ♣ ✰

The Distribution of Phonemes across Languages: Chance, costs, and integration across linguistic tiers. 2025. Fermin Moscoso del Prado Martin, Suchir Salhan. 13th Conference on the Mental Lexicon. Invited Keynote delivered by Fermin Moscoso del Prado Martin in McGill University, Montreal (June 2025). Slides. ✦♦️

ByteSpan: Information-Driven Subword Tokenisation. Zebulon Goriely, Suchir Salhan, Pietro Lesci, Julius Cheng, Paula Buttery. ICML 2025 Tokenization Workshop (TokShop). Delivered ByteSpan Poster Presentation in Vancouver, Canada (August 2025). ✿

Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates, Mean Length of Utterances, and Annotation Invariance. Fermin Moscoso del Prado Martin, Suchir Salhan. ACL Main Conference (Poster) – I presented this in Vienna, Austria (August 2025). Poster | Slides. ♦️

Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research. Richard Diehl-Martinez, David Demitri Africa, Yuval Weiss, Suchir Salhan, Ryan Daniels, Paula Buttery. EMNLP Systems Demonstration 2025. Presentation in Suzhou, China. Pico Website | Demo Video (YouTube) | HuggingFace. ✰

Teacher Demonstrations in a BabyLM’s Zone of Proximal Development for Contingent Multi-Turn Interaction. Suchir Salhan, Hongyi Gu, Donya Rooein, Diana Galvan-Sosa, Gabrielle Gaudeau, Andrew Caines, Zheng Yuan, Paula Buttery. BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ✒︎ ♣

What's the Best Sequence Length for BabyLM?. Suchir Salhan, Richard Diehl-Martinez, Zebulon Goriely, Paula Buttery. BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ♣ ✰

BLiSS: Evaluating Bilingual Learner Competence in Second Language Small Language Models. Yuan Gao , Suchir Salhan, Andrew Caines, Paula Buttery, Weiwei Sun BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ♣ ✦

Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling. Bianca-Mihaela Ganescu, Suchir Salhan, Andrew Caines, Paula Buttery (Supervised MPhil Advanced Computer Science Thesis). BabyLM Workshop, EMNLP 2025. Presentation in Suzhou, China. ♣ ✰

Theoretical Linguistics Constrains Hypothesis-Driven Causal Abstraction in Mechanistic Interpretability. 2025. Suchir Salhan, Konstantinos Voudouris. NEURIPS First Workshop on CogInterp: Interpreting Cognition in Deep Learning Models. Presentation in San Diego, California, USA. ♣

The Distribution of Phonemes across Languages: Chance, costs, and integration across linguistic tiers. 2026. Fermin Moscoso del Prado Martin, Suchir Salhan. 23rd Old-World Conference in Phonology (OCP23), Gonville & Caius College. (Accepted Oral). ♦️

Convergent Equilibria in Cross-Lingual Phoneme Surprisal Distributions: Statistical and Simulation-Based Analysis. 2026. Suchir Salhan, Fermin Moscoso del Prado Martin. 23rd Old-World Conference in Phonology (OCP23), Gonville & Caius College. (Accepted Oral). Abstract♦️

Other publications:

On the Potential for Maximising Minimal Means in Transformer Language Models: A Dynamical Systems Perspective. Suchir Salhan .In Cambridge Occasional Papers in Linguistics, Department of Theoretical & Applied Linguistics, 2023. Paper | Slides (Undergraduate Dissertation, Presentation at SyntaxLab, February 2023, St John's College, Cambridge, organised by Dr Theresa Biberauer)

Linguistics in the Age of Language Models: What Can Cognitively-Inspired Language Models Offer to Linguistic Theory? * . Suchir Salhan . In Cambridge Occasional Papers in Linguistics, Department of Theoretical & Applied Linguistics, 2025. Paper.

Invited Talks, Presentations and Posters:

Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies. Suchir Salhan. Presentations at Cambridge Language Sciences Symposium (November 2024), Poster at HumanCLAIM Workshop organised by Prof Lisa Beinborn in Gottingen Germany in March 2025. Accepted Poster and Demonstration at Cambridge CHIA (Centre for Human-Inspired AI) Annual Conference in June 2025.

Human-Validated Grammar Profiles for Language Models. Tubingen, Germany; March 2025 in a workshop organised by Prof Detmar Meurers

LLMs “off-the-shelf” or Pretrain-from-Scratch? Recalibrating Biases and Improving Transparency using Small-Scale Language Models.
Suchir Salhan, Richard Diehl-Martinez, Zebulon Goriely, Andrew Caines, Paula Buttery
Learning & Human Intelligence Group, Department of Computer Science & Technology, 2024

Bilingual Small Language Models as Cognitive Proxies for LLM Interaction and Calibration. Suchir Salhan. Learning & Human Intelligence Group, Department of Computer Science & Technology, 2025.

Engineering Small Language Models as Learner Models for LLM Interaction and Calibration. Suchir Salhan. ALTA Annual Review 2025.

Teaching and Supervisions

Teaching:

Guest Lecturer and Teaching Assistant

L95 (ACS/Part III) Introduction to Natural Language Syntax and Parsing. Delivered a lecture on Language Model Evaluation and Mechanistic Interpretability (Nov 2024).
Guest Lecturer for Li18 Computational Linguistics, 2025-26 (Part II Linguistics Tripos).
Machine Learning & Real World Data (Part IA, Computer Science Tripos). Teaching Assistant (2024-25)

Supervisions

Machine Learning and Bayesian Inference (Part II, Computer Science Tripos)

Formal Models of Language (Part IB, Computer Science Tripos)

Artificial Intelligence (Part IB, Computer Science Tripos)

Probability (Part IA, Computer Science Tripos)

Li18 Computational Linguistics (Part IIA/IIB Linguistics Tripos)

College Supervisor for Linguistics Tripos (Gonville & Caius College) – Linguistic Theory (Part IIB, Linguistics Tripos), Part I Linguistics Tripos.

College Examiner for Computer Science Tripos Mock Examinations (Gonville & Caius College)

Research supervision:

Research Supervision

MPhil ACS Project Supervisor for Bianca Ganescu with Dr Andrew Caines and Prof Paula Buttery.

Undergraduate Research Opportunity Programme (UROP) Supervisor. Shivan Arora and Ellie Polyakova Reed (Summer 2025).

PicoLM Research Mentor for Google DeepMind Research Ready Programme, Summer 2025. Ali Kheirkhah.

Co-Advised Two MPhil module projects for Geometric Deep Learning (L65) on (1) Dynamic Tokenisation with Dr Dobrik Georgiev, Dr Petar Velikovic & Prof Pietro Lio and (2) Attention Graph Interpretability with Chaitanya Joshi, Dr Petar Velikovic & Prof Pietro Lio.

Co-Advising and Mentoring several independent Cambridge Research Projects (Jacy To, Andrzej Szablewski).

Other Professional Activities

Departmental Activity

Organiser and Host of the Natural Language & Information Processing Seminars, 2024 -. Natural Language & Information Processing Group (CST). Organising 30+ departmental seminars with leading academics and industry researchers on Language Models, Computational Linguistics and Natural Language Processing. List of Organised Seminars.

University-Wide/Interdisciplinary Initiatives

Language Sciences Annual Symposium 2025: Ambitions for language science in 2050. Poster Session Organiser for 2025 Cambridge Language Sciences Symposium with Sammy Weiss (MRC Cognition and Brain Sciences Unit) and Shanshan Hu (TAL). CLS 2025 Website.

23rd Old-World Conference in Phonology (OCP23). Member of Organising Committee. Gonville & Caius College (January 2026). OCP23 Website (Phonetic Laboratory, Department of Theoretical & Applied Linguistics).

Reviewing and Service

Reviewer for BabyLM 2024.

ACL 2025 Emergency Reviewer.

Reviewer for The First Workshop on Large Language Model Memorization. L2M2 Proceedings @ ACL 2025.

PhD Candidate

Departments and institutes:

Department of Computer Sciences and Technology

Contact Details

Email address:

sas245@cam.ac.uk

https://www.suchirsalhan.com/

https://www.cst.cam.ac.uk/people/sas245

Affiliations

Classifications:

Graduate Students

What we do

Cambridge Language Sciences is an Interdisciplinary Research Centre at the University of Cambridge. Our virtual network connects researchers from five schools across the university as well as other world-leading research institutions. Our aim is to strengthen research collaborations and knowledge transfer across disciplines in order to address large-scale multi-disciplinary research challenges relating to language research.

JOIN OUR NETWORK

JOIN OUR MAILING LIST

Events

View all events