We created two authorship attribution datasets in Bengali - Literary and Blogs.
The Literary dataset is a collection of 3,000 passages selected from the complete works of Rabindranath Tagore, Sarat Chandra Chattopadhyay, and Bankim Chandra Chattopadhyay.
Each author is represented by 1,000 passages.
The passages are selected randomly, and scrambled - so that they reflect a realistic corpus of fragments.
The passages are further divided into balanced train, test, and development sets.
The corpus is described in this paper.
The dataset is available from this link.
If you use this dataset, please cite the following work:
@inproceedings{PhaniLahiriBiswas2015, author = {Phani, Shanta and Lahiri, Shibamouli and Biswas, Arindam}, title = {{Authorship Attribution in Bengali Language}}, booktitle = {Proceedings of the 12th International Conference on Natural Language Processing (ICON)}, year = 2015, month = {December}, address = {Thiruvananthapuram, India}, url = {http://ltrc.iiit.ac.in/icon2015/icon2015_proceedings/PDF/37_rp.pdf} }