Skip to content

Latest commit

 

History

History
106 lines (77 loc) · 2.54 KB

PIXIV_CRAWLER.md

File metadata and controls

106 lines (77 loc) · 2.54 KB

Pixiv Crawler

This document mainly describes the design of the Pixiv crawler, yet may be outdated.

Design

  • Notations

  • Pipeline design

    Collect artwork url, image url in different stages, and pass them to the next stage.

  • High modularity and low coupling

    For example, if you already have image url (e.g., use with Pxer), you can consider passing it directly to the downloader for downloading.

  • Modules

    graph LR;
        F[Start]-->A;
        A<==Run parallel==>A;
        A[Crawler]--Send artwork_url-->B[Collector];
        B<==Run parallel==>B;
        B--Send image url-->D[Downloader];
        D==Run parallel==>D;
        D-->E[End];
    
    Loading
    pixiv_crawler
    │   config.py
    │   utils.py
    │
    ├───collector
    │   │   collector.py
    │   │   collector_unit.py
    │   └───selectors.py
    │
    ├───crawlers
    │   │   bookmark_crawler.py
    │   │   keyword_crawler.py
    │   │   ranking_crawler.py
    │   └───users_crawler.py
    │
    └───downloader
        │   downloader.py
        └───download_image.py
    
    • collector/collector_unit.py: Collect artwork_url and image_url.

      Passing different selectors to select different data.

    • collector/selectors.py: Functions for selecting different data from json or html.

    • crawlers/*: Implement different crawlers for different purposes.

    • downloader/downloader_image.py: Download images from image_url.

Appendix

  • pixiv.net/robots.txt

    User-agent: *
    Disallow: /cdn-cgi/
    Disallow: /rpc/index.php?mode=profile_module_illusts&user_id=*&illust_id=*
    Disallow: /ajax/illust/*/recommend/init
    Disallow: *return_to*
    Disallow: /?return_to=
    Disallow: /login.php?return_to=
    Disallow: /index.php?return_to=
    
    Disallow: /artworks/unlisted/*
    
    Disallow: /users/*/followers
    Disallow: /users/*/mypixiv
    Disallow: /users/*/bookmarks
    Disallow: /novel/comments.php?id=
    Disallow: /novels/unlisted/*
    
    Disallow: /en/group
    
    Disallow: /en/search/
    
    Disallow: /en/users/*/followers
    Disallow: /en/users/*/mypixiv
    Disallow: /en/users/*/bookmarks
    Disallow: /en/novel/comments.php?id=
    
    Disallow: /fanbox/search
    Disallow: /fanbox/tag
    
    Allow: /comic-indies/$
    Allow: /comic-indies/about
    Disallow: /comic-indies/