Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Documents

Fulltext
Final published version, 828 KB, PDF document

Riccardo Bassani
Søgaard, Anders
Tejaswini Deoskar

Multilingual language models exhibit better performance for some languages than for others (Singh et al., 2019), and many languages do not seem to benefit from multilingual sharing at all, presumably as a result of poor multilingual segmentation (Pyysal o et al., 2020). This work explores the idea of learning multilingual language models based on clustering of monolingual segments. We show significant improvements over standard multilingual segmentation and training across nine languages on a question answering task, both in a small model regime and for a model of the size of BERT-base.

Original language	English
Title of host publication	Proceedings of the 1st Workshop on Multilingual Representation Learning
Publisher	Association for Computational Linguistics
Publication date	2021
Pages	32–40
DOIs	https://doi.org/10.18653/v1/2021.mrl-1.3
Publication status	Published - 2021
Event	1st Workshop on Multilingual Representation Learning - Online Duration: 11 Nov 2021 → 11 Nov 2021

Conference

Conference	1st Workshop on Multilingual Representation Learning
By	Online
Periode	11/11/2021 → 11/11/2021

Number of downloads are based on statistics from Google Scholar and www.ku.dk

No data available

ID: 300080332

Department of Computer Science