Gutenberg Dataset

This is a collection of 3,036 English books written by 142 authors. This collection is a small subset of the Project Gutenberg corpus. All books have been manually cleaned to remove metadata, license information, and transcribers' notes, as much as possible.

The cleaned corpus is available from the link below. If you use this corpus, please cite the following work:

Link to Dataset

This is the only authentic version of the above dataset, as of creation. Other versions of the above dataset (except the one below), if any, are to be deemed non-authentic, even if public. - April 27, 2021.

UPDATE: A new version of the dataset has been prepared by Matthew D. Scholefield, which addresses some issues with the original dataset (link). - August 17, 2018