Skip to content

Archive all webpages in a webiste which are not already archived by archive.org

Notifications You must be signed in to change notification settings

apurvmishra99/archiver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Archive

A simple python script to generate a sitemap of a given website and archive all the pages not already stored in the wayback machine. This now available to use an API as well!

Check it out the documentation here

Setup

$ git clone https://github.com/apurvmishra99/archiver.git
$ cd archiver
$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Usage

Usage: archive.py [OPTIONS] URL

Options:
  -m, --max_urls INTEGER  The max number of urls to collect. The default value
                          is 50.Use 0 to set it as infinite.
  -d, --days INTEGER      The time difference(in days) of the current copy of
                          the page if it exists and we want to archive it
                          again. The default value is 7 days. Use 0 to archive
                          all pages again.
  --help                  Show this message and exit.
  

Example

$ python archive.py --days=7 --max_urls=50 https://apurvmishra.xyz

Alternative use

If you just want to scrape all the internal links on the website and write it to a txt file you can scrape_all_internal_links.py

Usage

Usage: scrape_all_internal_links.py [OPTIONS] URL

Options:
  --max_urls INTEGER  The max number of urls to collect. Use 0 to set it as
                      infinite.
  --help              Show this message and exit.

Example

$ python scrape_all_internal_links.py --max_urls=50 https://apurvmishra.xyz

TODO

  • Package the script
  • Convert to async
  • Add command line option to just generate the sitemap

Tested On

Pop!_OS 20.04 LTS
Python v3.7.6

Releases

No releases published

Packages

No packages published

Languages