LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Standard
LeXFiles and LegalLAMA : Facilitating English Multinational Legal Language Model Development. / Chalkidis, Ilias; Garneau, Nicolas; Søgaard, Anders; Goantă, Cătălină; Katz, Daniel Martin.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics (ACL), 2023. p. 15513-15535.Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - LeXFiles and LegalLAMA
T2 - 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023
AU - Chalkidis, Ilias
AU - Garneau, Nicolas
AU - Søgaard, Anders
AU - Goantă, Cătălină
AU - Katz, Daniel Martin
N1 - Publisher Copyright: © 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - In this work, we conduct a detailed analysis on the performance of legal-oriented pre-trained language models (PLMs). We examine the interplay between their original objective, acquired knowledge, and legal language understanding capacities which we define as the upstream, probing, and downstream performance, respectively. We consider not only the models' size but also the pre-training corpora used as important dimensions in our study. To this end, we release a multinational English legal corpus (LeXFiles) and a legal knowledge probing benchmark (LegalLAMA) to facilitate training and detailed analysis of legal-oriented PLMs. We release two new legal PLMs trained on LeXFiles and evaluate them alongside others on LegalLAMA and LexGLUE. We find that probing performance strongly correlates with upstream performance in related legal topics. On the other hand, downstream performance is mainly driven by the model's size and prior legal knowledge which can be estimated by upstream and probing performance. Based on these findings, we can conclude that both dimensions are important for those seeking the development of domain-specific PLMs.
AB - In this work, we conduct a detailed analysis on the performance of legal-oriented pre-trained language models (PLMs). We examine the interplay between their original objective, acquired knowledge, and legal language understanding capacities which we define as the upstream, probing, and downstream performance, respectively. We consider not only the models' size but also the pre-training corpora used as important dimensions in our study. To this end, we release a multinational English legal corpus (LeXFiles) and a legal knowledge probing benchmark (LegalLAMA) to facilitate training and detailed analysis of legal-oriented PLMs. We release two new legal PLMs trained on LeXFiles and evaluate them alongside others on LegalLAMA and LexGLUE. We find that probing performance strongly correlates with upstream performance in related legal topics. On the other hand, downstream performance is mainly driven by the model's size and prior legal knowledge which can be estimated by upstream and probing performance. Based on these findings, we can conclude that both dimensions are important for those seeking the development of domain-specific PLMs.
U2 - 10.18653/v1/2023.acl-long.865
DO - 10.18653/v1/2023.acl-long.865
M3 - Article in proceedings
AN - SCOPUS:85173828437
SP - 15513
EP - 15535
BT - Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
PB - Association for Computational Linguistics (ACL)
Y2 - 9 July 2023 through 14 July 2023
ER -
ID: 372528222