Replicating and Extending "Because Their Treebanks Leak": Graph Isomorphism, Covariants, and Parser Performance

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Documents

  • Fulltext

    Final published version, 916 KB, PDF document

Søgaard (2020) obtained results suggesting the fraction of trees occurring in the test data isomorphic to trees in the training set accounts for a non-trivial variation in parser performance. Similar to other statistical analyses in NLP, the results were based on evaluating linear regressions. However, the study had methodological issues and was undertaken using a small sample size leading to unreliable results. We present a replication study in which we also bin sentences by length and find that only a small subset of sentences vary in performance with respect to graph isomorphism. Further, the correlation observed between parser performance and graph isomorphism in the wild disappears when controlling for covariants. However, in a controlled experiment, where covariants are kept fixed, we do observe a strong correlation. We suggest that conclusions drawn from statistical analyses like this need to be tempered and that controlled experiments can complement them by more readily teasing factors apart.

Original languageEnglish
Title of host publicationProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
PublisherAssociation for Computational Linguistics
Publication date2021
Pages1090-1098
ISBN (Electronic)9781954085527
Publication statusPublished - 2021
EventJoint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021 - Virtual, Online
Duration: 1 Aug 20216 Aug 2021

Conference

ConferenceJoint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021
ByVirtual, Online
Periode01/08/202106/08/2021
SponsorAmazon Science, Apple, Bloomberg Engineering, et al., Facebook AI, Google Research

Bibliographical note

Publisher Copyright:
© 2021 Association for Computational Linguistics.

ID: 291681189