Gutenberg Dataset


This is a collection of 3,036 English books written by 142 authors. This collection is a small subset of the Project Gutenberg corpus. All books have been manually cleaned to remove metadata, license information, and transcribers' notes, as much as possible.

The cleaned corpus is available from the link below. If you use this corpus, please cite the following work:



@InProceedings{lahiri:2014:SRW,
  author    = {Lahiri, Shibamouli},
  title     = {{Complexity of Word Collocation Networks: A Preliminary Structural Analysis}},
  booktitle = {Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics},
  month     = {April},
  year      = {2014},
  address   = {Gothenburg, Sweden},
  publisher = {Association for Computational Linguistics},
  pages     = {96--105},
  url       = {http://www.aclweb.org/anthology/E14-3011}
}

Link to Dataset

This is the only authentic version of the above dataset, as of creation. Other versions of the above dataset (except the one below), if any, are to be deemed non-authentic, even if public. - April 27, 2021.



UPDATE: A new version of the dataset has been prepared by Matthew D. Scholefield, which addresses some issues with the original dataset (link). - August 17, 2018



Home