Skip to content

This repo contains a scrapper for the Gutenberg's project website which contains 56,019 books free to read and download. In this repo also, you can find text file containing all the book data until April 2018 containing only the 'id', 'title' and 'authors' for every book in the dataset.

Notifications You must be signed in to change notification settings

Anwarvic/GutenbergScrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

GutenbergScrapper

This repo contains a multi-threading scrapper for the Gutenberg's project website which contains 56,920 books free to read and download. It can scrape the whole website in just 5 hours. Also in this repo, you can find a text file containing the whole data until April 2018 containing only the Ebook-No., title, authors and language for every book because these attributes are the only ones that I cared about.

You can add these attributes as well:

  • Subject
  • LoC Class
  • Category
  • Release Date
  • Copyright Status
  • Downloads
  • Price

If you want to add any of/all these attributes, you can modify the script to add whatever you want by only modifying the member variable INCLUDE like so:

self.INCLUDE = set(['Title', 'Author', 'EBook-No.', 'Language'])

Then, run the script.

The collected data will be like this:

ID: 1
Author: Jefferson, Thomas, 1743-1826
Title: The Declaration of Independence of the United States of America
Language: English

ID: 2
Author: United States
Title: The United States Bill of Rights
The Ten Original Amendments to the Constitution of the United States
Language: English

ID: 3
Author: Kennedy, John F. (John Fitzgerald), 1917-1963
Title: John F. Kennedy's Inaugural Address
Language: English

...

prerequisites

You need to install:

  • Python3
  • Beautiful Soap 4.0
  • requests
  • multiprocessing

About

This repo contains a scrapper for the Gutenberg's project website which contains 56,019 books free to read and download. In this repo also, you can find text file containing all the book data until April 2018 containing only the 'id', 'title' and 'authors' for every book in the dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages