Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization. / Bassani, Riccardo ; Søgaard, Anders; Deoskar, Tejaswini .

Proceedings of the 1st Workshop on Multilingual Representation Learning. Association for Computational Linguistics, 2021. p. 32–40.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Bassani, R, Søgaard, A & Deoskar, T 2021, Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization. in Proceedings of the 1st Workshop on Multilingual Representation Learning. Association for Computational Linguistics, pp. 32–40, 1st Workshop on Multilingual Representation Learning, Online, 11/11/2021. https://doi.org/10.18653/v1/2021.mrl-1.3

APA

Bassani, R., Søgaard, A., & Deoskar, T. (2021). Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization. In Proceedings of the 1st Workshop on Multilingual Representation Learning (pp. 32–40). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.mrl-1.3

Vancouver

Bassani R, Søgaard A, Deoskar T. Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization. In Proceedings of the 1st Workshop on Multilingual Representation Learning. Association for Computational Linguistics. 2021. p. 32–40 https://doi.org/10.18653/v1/2021.mrl-1.3

Author

Bassani, Riccardo ; Søgaard, Anders ; Deoskar, Tejaswini . / Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization. Proceedings of the 1st Workshop on Multilingual Representation Learning. Association for Computational Linguistics, 2021. pp. 32–40

Bibtex

@inproceedings{6287b9099adf4d0aab3f61cf6ac94aee,

title = "Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization",

abstract = "Multilingual language models exhibit better performance for some languages than for others (Singh et al., 2019), and many languages do not seem to benefit from multilingual sharing at all, presumably as a result of poor multilingual segmentation (Pyysal o et al., 2020). This work explores the idea of learning multilingual language models based on clustering of monolingual segments. We show significant improvements over standard multilingual segmentation and training across nine languages on a question answering task, both in a small model regime and for a model of the size of BERT-base.",

author = "Riccardo Bassani and Anders S{\o}gaard and Tejaswini Deoskar",

year = "2021",

doi = "10.18653/v1/2021.mrl-1.3",

language = "English",

pages = "32–40",

booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning",

publisher = "Association for Computational Linguistics",

note = "1st Workshop on Multilingual Representation Learning ; Conference date: 11-11-2021 Through 11-11-2021",

}

RIS

TY - GEN

T1 - Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization

AU - Bassani, Riccardo

AU - Søgaard, Anders

AU - Deoskar, Tejaswini

PY - 2021

Y1 - 2021

N2 - Multilingual language models exhibit better performance for some languages than for others (Singh et al., 2019), and many languages do not seem to benefit from multilingual sharing at all, presumably as a result of poor multilingual segmentation (Pyysal o et al., 2020). This work explores the idea of learning multilingual language models based on clustering of monolingual segments. We show significant improvements over standard multilingual segmentation and training across nine languages on a question answering task, both in a small model regime and for a model of the size of BERT-base.

AB - Multilingual language models exhibit better performance for some languages than for others (Singh et al., 2019), and many languages do not seem to benefit from multilingual sharing at all, presumably as a result of poor multilingual segmentation (Pyysal o et al., 2020). This work explores the idea of learning multilingual language models based on clustering of monolingual segments. We show significant improvements over standard multilingual segmentation and training across nine languages on a question answering task, both in a small model regime and for a model of the size of BERT-base.

U2 - 10.18653/v1/2021.mrl-1.3

DO - 10.18653/v1/2021.mrl-1.3

M3 - Article in proceedings

SP - 32

EP - 40

BT - Proceedings of the 1st Workshop on Multilingual Representation Learning

PB - Association for Computational Linguistics

T2 - 1st Workshop on Multilingual Representation Learning

Y2 - 11 November 2021 through 11 November 2021

ER -

ID: 300080332

Department of Computer Science