Skip to content

Collect/process data via various data sources : website / js website / API. Run scrapping pipeline via Celery, and Travis cron task. Dump the scraped data to slack

Notifications You must be signed in to change notification settings


Repository files navigation


Collection of scrapper pipelines build for different purposes

Build Status PRs


  • Architecture idea
  • Asynchronous tasks
    • Celery client : flask <---> Celery client <---> Celery worker. Be connected to flask to the celery task, issue the commands for the tasks
    • Celery worker : A process that runs tasks in background, can be a scheduluedtask (periodic task), and a asynchronous (when API call) one.
    • Massage broker : Celery client <--Massage broker-> Celery worker. The Celery client will need to via Message worker to communicate with Celery worker. Here I use Redis as the Message broker.

Quick Start

Quick start via docker
# Run via docker 
$ cd ~ && git clone
$ cd ~ && cd web_scraping &&  docker-compose -f  docker-compose.yml up 
Quick start manually
# Run manually 

# STEP 1) open one terminal and run celery server locally 
$ cd ~ && cd web_scraping/celery_queue
# run task from API call  
$ celery -A tasks worker --loglevel=info
# run cron (periodic) task 
$ celery -A tasks beat

# STEP 2) Run radis server locally (with the other terminal)
# make sure you have already installed radis
$ redis-server

# STEP 3) Run flower  (with the other terminal)
$ cd ~ && cd web_scraping/celery_queue
$ celery flower -A tasks --address= --port=5555

# STEP 4) Add a sample task 
# "add" task
$ curl -X POST -d '{"args":[1,2]}' http://localhost:5555/api/task/async-apply/tasks.add

# "multiply" task
$ curl -X POST -d '{"args":[3,5]}' http://localhost:5555/api/task/async-apply/tasks.multiply

# "scrape_task" task
$ curl -X POST   http://localhost:5555/api/task/async-apply/tasks.scrape_task

# "scrape_task_api" task
$ curl -X POST -d '{"args":["mlflow","mlflow"]}' http://localhost:5555/api/task/async-apply/tasks.scrape_task_api

# "indeed_scrap_task" task
$ curl -X POST  http://localhost:5555/api/task/async-apply/tasks.indeed_scrap_task

# "indeed_scrap_api_V1" task
$ curl -X POST -d '{"args":["New+York"]}' http://localhost:5555/api/task/async-apply/tasks.indeed_scrap_api_V1

File structure

├── Dockerfile
├── api.                  : Celery api (broker, job accepter(flask))
│   ├── Dockerfile        : Dockerfile build celery api 
│   ├──            : Flask server accept job request(api)
│   ├── requirements.txt
│   └──         : Celery broker, celery backend(redis)
├── celery-queue          : Run main web scrapping jobs (via celery)
│   ├── Dockerfile        : Dockerfile build celery-queue
│   ├── IndeedScrapper    : Scrapper scrape 
│   ├── requirements.txt
│   └──          : Celery run scrapping tasks 
├── docker-compose.yml    : docker-compose build whole system : api, celery-queue, redis, and flower(celery job monitor)
├── legacy_project        
├── logs                  : Save running logs 
├── output                : Save scraped data 
├── requirements.txt
└── : Script auto push output to github via Travis 


# Run Unit test # 1 
$ pytest -v tests/
# ================================== test session starts ==================================
# platform darwin -- Python 3.6.4, pytest-5.0.1, py-1.5.2, pluggy-0.12.0 -- /Users/jerryliu/anaconda3/envs/yen_dev/bin/python
# cachedir: .pytest_cache
# rootdir: /Users/jerryliu/web_scraping
# plugins: cov-2.7.1, celery-4.3.0
# collected 10 items                                                                      
# tests/ PASSED                                          [ 10%]
# tests/ PASSED                                   [ 20%]
# tests/ PASSED                                    [ 30%]
# tests/ PASSED                                  [ 40%]
# tests/ PASSED                                 [ 50%]
# tests/ PASSED                                   [ 60%]
# tests/ PASSED                                      [ 70%]
# tests/ PASSED                                      [ 80%]
# tests/ PASSED                                  [ 90%]
# tests/ PASSED                                [100%]

# Run Unit test # 2 
python tests/  -v
# test_addition (__main__.TestAddTask) ... ok
# test_task_state (__main__.TestAddTask) ... ok
# test_multiplication (__main__.TestMultiplyTask) ... ok
# test_task_state (__main__.TestMultiplyTask) ... ok
# ----------------------------------------------------------------------
# Ran 4 tests in 0.131s
# OK


  • Celery : parallel/single thread python tasks management tool (celery broker/worker)
  • Redis : key-value DB save task data
  • Flower : UI monitor celery tasks
  • Flask : python light web framework, as project backend server here
  • Docker : build the app environment


### Project level

0. Deploy to Heroku cloud and make the scrapper as an API service 
1. Dockerize the project 
2. Run the scrapping (cron/paralel)jobs via Celery 
4. Add test (unit/integration test) 
5. Design DB model that save scrapping data systematically 

### Programming level 

1. Add utility scripts that can get XPATH of all objects in html
2. Workflow that automate whole processes
3. Job management 
	- Multiprocessing
	- Asynchronous
	- Queue 
4. Scrapping tutorial 
5. Scrapy, Phantomjs 

### Others 

1. Web scrapping 101 tutorial 




Collect/process data via various data sources : website / js website / API. Run scrapping pipeline via Celery, and Travis cron task. Dump the scraped data to slack







No releases published


No packages published