The Impact of Positional Encodings on Multilingual Compression

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Standard

The Impact of Positional Encodings on Multilingual Compression. / Ravishankar, Vinit; Søgaard, Anders.

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. p. 763-777.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Harvard

Ravishankar, V & Søgaard, A 2021, The Impact of Positional Encodings on Multilingual Compression. in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 763-777, 2021 Conference on Empirical Methods in Natural Language Processing, 07/11/2021. https://doi.org/10.18653/v1/2021.emnlp-main.59

APA

Ravishankar, V., & Søgaard, A. (2021). The Impact of Positional Encodings on Multilingual Compression. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 763-777). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.59

Vancouver

Ravishankar V, Søgaard A. The Impact of Positional Encodings on Multilingual Compression. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 2021. p. 763-777 https://doi.org/10.18653/v1/2021.emnlp-main.59

Author

Ravishankar, Vinit ; Søgaard, Anders. / The Impact of Positional Encodings on Multilingual Compression. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. pp. 763-777

Bibtex

@inproceedings{3ff520f1e214433cacd91694422323c1,
title = "The Impact of Positional Encodings on Multilingual Compression",
abstract = "In order to preserve word-order information in a non-autoregressive setting, transformer architectures tend to include positional knowledge, by (for instance) adding positional encodings to token embeddings. Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture; these include, for instance, separating position encodings and token embeddings, or directly modifying attention weights based on the distance between word pairs. We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models. We then answer why that is: Sinusoidal encodings were explicitly designed to facilitate compositionality by allowing linear projections over arbitrary time steps. Higher variances in multilingual training distributions requires higher compression, in which case, compositionality becomes indispensable. Learned absolute positional encodings (e.g., in mBERT) tend to approximate sinusoidal embeddings in multilingual settings, but more complex positional encoding architectures lack the inductive bias to effectively learn compositionality and cross-lingual alignment. In other words, while sinusoidal positional encodings were originally designed for monolingual applications, they are particularly useful in multilingual language models.",
author = "Vinit Ravishankar and Anders S{\o}gaard",
year = "2021",
doi = "10.18653/v1/2021.emnlp-main.59",
language = "English",
pages = "763--777",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
publisher = "Association for Computational Linguistics",
note = "2021 Conference on Empirical Methods in Natural Language Processing ; Conference date: 07-11-2021 Through 11-11-2021",

}

RIS

TY - GEN

T1 - The Impact of Positional Encodings on Multilingual Compression

AU - Ravishankar, Vinit

AU - Søgaard, Anders

PY - 2021

Y1 - 2021

N2 - In order to preserve word-order information in a non-autoregressive setting, transformer architectures tend to include positional knowledge, by (for instance) adding positional encodings to token embeddings. Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture; these include, for instance, separating position encodings and token embeddings, or directly modifying attention weights based on the distance between word pairs. We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models. We then answer why that is: Sinusoidal encodings were explicitly designed to facilitate compositionality by allowing linear projections over arbitrary time steps. Higher variances in multilingual training distributions requires higher compression, in which case, compositionality becomes indispensable. Learned absolute positional encodings (e.g., in mBERT) tend to approximate sinusoidal embeddings in multilingual settings, but more complex positional encoding architectures lack the inductive bias to effectively learn compositionality and cross-lingual alignment. In other words, while sinusoidal positional encodings were originally designed for monolingual applications, they are particularly useful in multilingual language models.

AB - In order to preserve word-order information in a non-autoregressive setting, transformer architectures tend to include positional knowledge, by (for instance) adding positional encodings to token embeddings. Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture; these include, for instance, separating position encodings and token embeddings, or directly modifying attention weights based on the distance between word pairs. We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models. We then answer why that is: Sinusoidal encodings were explicitly designed to facilitate compositionality by allowing linear projections over arbitrary time steps. Higher variances in multilingual training distributions requires higher compression, in which case, compositionality becomes indispensable. Learned absolute positional encodings (e.g., in mBERT) tend to approximate sinusoidal embeddings in multilingual settings, but more complex positional encoding architectures lack the inductive bias to effectively learn compositionality and cross-lingual alignment. In other words, while sinusoidal positional encodings were originally designed for monolingual applications, they are particularly useful in multilingual language models.

U2 - 10.18653/v1/2021.emnlp-main.59

DO - 10.18653/v1/2021.emnlp-main.59

M3 - Article in proceedings

SP - 763

EP - 777

BT - Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

PB - Association for Computational Linguistics

T2 - 2021 Conference on Empirical Methods in Natural Language Processing

Y2 - 7 November 2021 through 11 November 2021

ER -

ID: 299760921