Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Documents

  • Fulltext

    Final published version, 828 KB, PDF document

Multilingual language models exhibit better performance for some languages than for others (Singh et al., 2019), and many languages do not seem to benefit from multilingual sharing at all, presumably as a result of poor multilingual segmentation (Pyysal o et al., 2020). This work explores the idea of learning multilingual language models based on clustering of monolingual segments. We show significant improvements over standard multilingual segmentation and training across nine languages on a question answering task, both in a small model regime and for a model of the size of BERT-base.
Original languageEnglish
Title of host publicationProceedings of the 1st Workshop on Multilingual Representation Learning
PublisherAssociation for Computational Linguistics
Publication date2021
Pages32–40
DOIs
Publication statusPublished - 2021
Event1st Workshop on Multilingual Representation Learning - Online
Duration: 11 Nov 202111 Nov 2021

Conference

Conference1st Workshop on Multilingual Representation Learning
ByOnline
Periode11/11/202111/11/2021

Number of downloads are based on statistics from Google Scholar and www.ku.dk


No data available

ID: 300080332