Cleaned ACL ARC

ACL Anthology Reference Corpus (ACL ARC) is a collection of 10,920 academic papers from the ACL Anthology. We manually cleaned ACL ARC to remove:

(a) files that look like not full papers, paper fragments, foreign-language papers (e.g., French), or pure junk.
(b) headers (title and author information; NOT abstract).
(c) footers ("References" line and the actual references).
(d) some bad characters (spurious characters).
(e) some page numbers (i.e., a single number appearing on a line, with nothing else attached to it).
(f) significant foreign-language (e.g., French) content in an otherwise English paper.

The cleaned corpus has 10,628 documents. It is available from the link below. If you use this corpus, please cite the following work:

@inproceedings{bird2008acl,
  title={{The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics}},
  author={Bird, Steven and Dale, Robert and Dorr, Bonnie J. and Gibson, Bryan and Joseph, Mark T. and Kan, Min-Yen and Lee, Dongwon and Powley, Brett and Radev, Dragomir R. and Tan, Yee Fan},
  booktitle={Proc. of the 6th International Conference on Language Resources and Evaluation Conference (LREC’08)},
  pages={1755--1759},
  year={2008}}

@unpublished{LahiriACLARCStyleBrowser,
      author = {Lahiri, Shibamouli},
      title = {{ACL ARC Style Browser}},
      note = "\url{http://ec2-54-186-204-149.us-west-2.compute.amazonaws.com/acl_arc_style_browser/}",
      year = 2014}

Link to Dataset

_{We also prepared an ASCII version of the dataset by removing all characters outside the range [1, 254]. This version is here.}

This is the only authentic version of the datasets, as of creation. Other versions, if any, are to be deemed non-authentic, even if public. - April 27, 2021.