Replicating and Extending "Because Their Treebanks Leak"

Replicating and Extending "Because Their Treebanks Leak": Graph Isomorphism, Covariants, and Parser Performance

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Standard

Replicating and Extending "Because Their Treebanks Leak" : Graph Isomorphism, Covariants, and Parser Performance. / Anderson, Mark; Søgaard, Anders; Gómez-Rodriguez, Carlos.

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, 2021. p. 1090-1098.

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Harvard

Anderson, M, Søgaard, A & Gómez-Rodriguez, C 2021, Replicating and Extending "Because Their Treebanks Leak": Graph Isomorphism, Covariants, and Parser Performance. in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, pp. 1090-1098, Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021, Virtual, Online, 01/08/2021.

APA

Anderson, M., Søgaard, A., & Gómez-Rodriguez, C. (2021). Replicating and Extending "Because Their Treebanks Leak": Graph Isomorphism, Covariants, and Parser Performance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (pp. 1090-1098). Association for Computational Linguistics.

Vancouver

Anderson M, Søgaard A, Gómez-Rodriguez C. Replicating and Extending "Because Their Treebanks Leak": Graph Isomorphism, Covariants, and Parser Performance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics. 2021. p. 1090-1098

Author

Anderson, Mark ; Søgaard, Anders ; Gómez-Rodriguez, Carlos. / Replicating and Extending "Because Their Treebanks Leak" : Graph Isomorphism, Covariants, and Parser Performance. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, 2021. pp. 1090-1098

Bibtex

@inproceedings{87647dac51c9437db6cc533df2fa8dd5,

title = "Replicating and Extending {"}Because Their Treebanks Leak{"}: Graph Isomorphism, Covariants, and Parser Performance",

abstract = "S{\o}gaard (2020) obtained results suggesting the fraction of trees occurring in the test data isomorphic to trees in the training set accounts for a non-trivial variation in parser performance. Similar to other statistical analyses in NLP, the results were based on evaluating linear regressions. However, the study had methodological issues and was undertaken using a small sample size leading to unreliable results. We present a replication study in which we also bin sentences by length and find that only a small subset of sentences vary in performance with respect to graph isomorphism. Further, the correlation observed between parser performance and graph isomorphism in the wild disappears when controlling for covariants. However, in a controlled experiment, where covariants are kept fixed, we do observe a strong correlation. We suggest that conclusions drawn from statistical analyses like this need to be tempered and that controlled experiments can complement them by more readily teasing factors apart.",

author = "Mark Anderson and Anders S{\o}gaard and Carlos G{\'o}mez-Rodriguez",

note = "Publisher Copyright: {\textcopyright} 2021 Association for Computational Linguistics.; Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021 ; Conference date: 01-08-2021 Through 06-08-2021",

year = "2021",

language = "English",

pages = "1090--1098",

booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",

publisher = "Association for Computational Linguistics",

}

RIS

TY - GEN

T1 - Replicating and Extending "Because Their Treebanks Leak"

T2 - Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021

AU - Anderson, Mark

AU - Søgaard, Anders

AU - Gómez-Rodriguez, Carlos

PY - 2021

Y1 - 2021

N2 - Søgaard (2020) obtained results suggesting the fraction of trees occurring in the test data isomorphic to trees in the training set accounts for a non-trivial variation in parser performance. Similar to other statistical analyses in NLP, the results were based on evaluating linear regressions. However, the study had methodological issues and was undertaken using a small sample size leading to unreliable results. We present a replication study in which we also bin sentences by length and find that only a small subset of sentences vary in performance with respect to graph isomorphism. Further, the correlation observed between parser performance and graph isomorphism in the wild disappears when controlling for covariants. However, in a controlled experiment, where covariants are kept fixed, we do observe a strong correlation. We suggest that conclusions drawn from statistical analyses like this need to be tempered and that controlled experiments can complement them by more readily teasing factors apart.

AB - Søgaard (2020) obtained results suggesting the fraction of trees occurring in the test data isomorphic to trees in the training set accounts for a non-trivial variation in parser performance. Similar to other statistical analyses in NLP, the results were based on evaluating linear regressions. However, the study had methodological issues and was undertaken using a small sample size leading to unreliable results. We present a replication study in which we also bin sentences by length and find that only a small subset of sentences vary in performance with respect to graph isomorphism. Further, the correlation observed between parser performance and graph isomorphism in the wild disappears when controlling for covariants. However, in a controlled experiment, where covariants are kept fixed, we do observe a strong correlation. We suggest that conclusions drawn from statistical analyses like this need to be tempered and that controlled experiments can complement them by more readily teasing factors apart.

UR - http://www.scopus.com/inward/record.url?scp=85121214829&partnerID=8YFLogxK

M3 - Article in proceedings

AN - SCOPUS:85121214829

SP - 1090

EP - 1098

BT - Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

PB - Association for Computational Linguistics

Y2 - 1 August 2021 through 6 August 2021

ER -

ID: 291681189

Department of Computer Science