not-quite-daily-scraper

A simple scraper for simple webcomics.

Download comic images and author comments with ease, utilizing a simple config file.

Output either just images or HTML pages with built in navigation.

Several example configurations are included.

Currently under development. Developed for Windows.

Setup

Download repo

Install the following dependencies:

selenium (pip3 install selenium)

Download the following dependencies and install them to PATH (Windows):

geckoDriver (https://github.com/mozilla/geckodriver/releases)

Usage

Create and fill out a config file for each comic you want to scrape, using config.txt as a template. (If you choose to use the HTML option it is advised to start at the current page and iterate backwards- This allows the script to link each page to the next in the correct order.)

Settings/Configurables include:

Output Path (leave blank to use project directory)
Create Subfolder? (if enabled, creates subdir based on comic name)
Download Comments?
Download Image?
Image Naming (chose to base on either comic title, on original image name, or both)
Run Headless? (run without showing the browser window- disable this for troubleshooting)
Comic Name
comicStartPage (XPATH^†, relative or absolute)
imageTitlePath (XPATH^†, relative or absolute)
nextButtonPath (XPATH^†, relative or absolute)
nextButtonType (accepts either link or javaClick, representing whether the button has an href or must be clicked)
imagePath (XPATH^†, relative or absolute)
commentPath (XPATH^†, relative or absolute)
initialClick (XPATH^†, relative or absolute, runs once at start- can be used to start at homepage, then click latest page link)

^† A good guide to getting the xpath to page elements: https://stackoverflow.com/a/42194160

Then run:

python scraper.py My-Config.txt

Notes

This tool is intended to be flexible, such that it can be used in a variety of ways.
My primary usage is as follows:
- Set comicStartPage to the homepage of the comic
- Use initial click to navigate to the newest page, via the link on the homepage
- Set nextButtonPath to the 'previous page' button so the script iterates backwards over the comic
- Choose comments and images+pages to generate nice html pages with author comments and navivation
Alternately you could set comicStartPage to page one, and the nextButtonPath to the next button and just iterate forward and only save images.
If you are having trouble with page element xpaths, try both relative and absolute.
If the comic pages have a date block, it can be used in conjuction with the image filename for nice naming.
Some sites (such as Ava's Demon) have javascript buttons- that's what the nextbutton type is for.

Psuedocode

Check if config file exists
Gather variables from config file
Build output directory path, and create it if it does not exist
Configure and open (headless) firefox instance with selenium
Check if starting page is valid
If user requested it, click the provided initialClick element
Loop through pages, setting each new page based on the 'next' link
- Gather relevant page elements
- Extract data from page elements
- Compose output paths and names
- Sanitise paths and names
- Save image (if requested)
- Save comment (if requested)
- Build html page with navigation links (if requested)
- Record the current/previous page
- Click the next button to proceed to new page
- Check against list of visited pages to confirm we aren't stuck in a loop
....Profit!

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
Sample Configurations		Sample Configurations
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.txt		config.txt
scrape.py		scrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample Configurations

Sample Configurations

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

config.txt

config.txt

scrape.py

scrape.py

Repository files navigation

not-quite-daily-scraper

Setup

Usage

Notes

Psuedocode

About

Languages

License

Ruthalas/not-quite-daily-scraper

Folders and files

Latest commit

History

Repository files navigation

not-quite-daily-scraper

Setup

Usage

Notes

Psuedocode

About

Topics

Resources

License

Stars

Watchers

Forks

Languages