Sociolectal Analysis of Pretrained Language Models

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Documents

Fulltext
Final published version, 380 KB, PDF document

Sheng Zhang
Xin Zhang
Weiming Zhang
Søgaard, Anders

Using data from English cloze tests, in which subjects also self-reported their gender, age, education, and race, we examine performance differences of pretrained language models across demographic groups, defined by these (protected) attributes. We demonstrate wide performance gaps across demographic groups and show that pretrained language models systematically disfavor young non-white male speakers; i.e., not only do pretrained language models learn social biases (stereotypical associations) – pretrained language models also learn sociolectal biases, learning to speak more like some than like others. We show, however, that, with the exception of BERT models, larger pretrained language models reduce some the performance gaps between majority and minority groups.

Original language	English
Title of host publication	Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Publisher	Association for Computational Linguistics
Publication date	2021
Pages	4581–4588
DOIs	https://doi.org/10.18653/v1/2021.emnlp-main.375
Publication status	Published - 2021
Event	2021 Conference on Empirical Methods in Natural Language Processing - Duration: 7 Nov 2021 → 11 Nov 2021

Conference

Conference	2021 Conference on Empirical Methods in Natural Language Processing
Periode	07/11/2021 → 11/11/2021

ID: 299822479

Department of Computer Science