Crowdsourcing and annotating NER for Twitter #drift

Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review

Hege Fromreide
Dirk Hovy
Søgaard, Anders

We present two new NER datasets for Twitter; a manually annotated set of 1,467 tweets (kappa=0.942) and a set of 2,975 expert-corrected, crowdsourced NER annotated tweets from the dataset described in Finin et al. (2010). In our experiments with these datasets, we observe two important points: (a) language drift on Twitter is significant, and while off-the-shelf systems have been reported to perform well on in-sample data, they often perform poorly on new samples of tweets, (b) state-of-the-art performance across various datasets can beobtained from crowdsourced annotations, making it more feasible to “catch up” with language drift.

Original language	English
Title of host publication	Proceedings of the 9th International Conference on Language Resources and Evaluation : LREC2014
Publisher	European Language Resources Association
Publication date	2014
Publication status	Published - 2014

ID: 105105333

Department of Computer Science

Crowdsourcing and annotating NER for Twitter #drift