Crowdsourcing and annotating NER for Twitter #drift
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
We present two new NER datasets for Twitter; a manually annotated set of 1,467 tweets (kappa=0.942) and a set of 2,975 expert-corrected, crowdsourced NER annotated tweets from the dataset described in Finin et al. (2010). In our experiments with these datasets, we observe two important points: (a) language drift on Twitter is significant, and while off-the-shelf systems have been reported to perform well on in-sample data, they often perform poorly on new samples of tweets, (b) state-of-the-art performance across various datasets can beobtained from crowdsourced annotations, making it more feasible to “catch up” with language drift.
Original language | English |
---|---|
Title of host publication | Proceedings of the 9th International Conference on Language Resources and Evaluation : LREC2014 |
Publisher | European Language Resources Association |
Publication date | 2014 |
Publication status | Published - 2014 |
ID: 105105333