ALERTWildfire Scraper and Tweet Monitor

A multi-pronged service created for the goal of collecting training data for USC research project "Early Fire Detection" that includes:

an ArangoDB instance which stores the urls of ALERTWildfire's cameras (as collected by scripts/enumerator.py) and Tweets of interest as collected by the Tweet monitor
a distributed and asynchronous scraper which collects classic cam images from http://www.AlertWildfire.org and uploads a zip compressed archive of the images to Google Drive after each full execution
a Tweet monitor that saves Tweets that mention @AlertWildfire's Twitter account (potentially in regards to a wildfire) to a database
an asynchronous scraper that retrieves infrared cam images from http://beta.alertwildfire.org/infrared-cameras/ and uploads the images to Google Drive

ALERTWildfire

"ALERTWildfire is a network of over 900 specialized camera installations in California, Nevada, Idaho and Oregon used by first responders and volunteers to detect and monitor wildfires." - Nevada Today

Prerequisites

Create a Twitter Developer account, start a new project, and set the SEARCHTWEETS_ENDPOINT, SEARCHTWEETS_BEARER_TOKEN, SEARCHTWEETS_CONSUMER_KEY, and SEARCHTWEETS_CONSUMER_SECRET environment variables in docker-compose.yml accordingly. Step-by-step guide to making your first request to the new Twitter API v2
Create a Google Developer account, create a new project with the Google Drive API (ensure that the scopes include read access to file metadata and write/file upload access to drive), authenticate a user outside of Docker (I used Google's quickstart and a modified version of this exists at scripts/gdrive-token-helper.py), and set PROJECT_ID, TOKEN, REFRESH_TOKEN, and GDRIVE_PARENT_DIR environment variables accordingly.

Run It

docker-compose build --parallel && docker-compose up -d

ArangoDB

ArangoDB database instance that stores all classic camera URLS (as collected by scripts/enumerator.py), infrared camera URLS, and Tweets from the Tweet Alerts monitor

Technologies:

Docker
ArangoDB (latest)

Collections

cameras example:

{
  "url": "http://www.alertwildfire.org/orangecoca/index.html?camera=Axis-DeerCanyon1",
  "timestamp": "2021-08-24T20:51:37.433870",
  "axis": "orangecoca.Axis-DeerCanyon1"
}

tweets example:

{
  "id": "1430287078156234757",
  "text": "RT @CphilpottCraig: Evening timelapse 5:25-6:25pm #CaldorFire Armstrong Lookout camera. @AlertWildfire viewing North from South side of fir…",
  "scrape_timestamp": "2021-08-24T22:55:25.862109"
}

ir-cameras example::

{
  "axis": "Danaher_606Z_Thermal",
  "epoch": 1631050791,
  "url": "https://weathernode.net/img/flir/Danaher_606Z_Thermal_1631050791.jpg",
  "timestamp": "2021-09-09T18:54:53.195532"
}

Redis

Celery backend for scraping app.

Technologies:

Docker
Redis (latest)

RabbitMQ

Celery broker for scraping app.

Technologies:

Docker
RabbitMQ (latest)

rabbitmq.conf

RabbitMQ config file located at rabbitmq/myrabbit.conf. consumer_timeout is set to 1 hour in milliseconds, 10 minutes longer than the timeout time (in seconds) explicitly set for each scraping task in the Scraper's producer.

## Consumer timeout
## If a message delivered to a consumer has not been acknowledge before this timer
## triggers the channel will be force closed by the broker. This ensure that
## faultly consumers that never ack will not hold on to messages indefinitely.
##
## Set to 1 hour in milliseconds
consumer_timeout = 3600000

Classic Scraper

Producer

Classic cameras image scraping queue producer. This process is invoked when a new Tweet to AlertWildfire's Twitter account is recognized. Tweets are queried every minute. If a camera is mentioned by name or axis in a Tweet's text, the camera is prioritized when scraping.

Technologies:

Docker
ArangoDB (latest)
Python 3.9
- Celery (5.1.2)
- searchtweets-v2
Redis (latest)
RabbitMQ (latest)

Environment Variables

RABBITMQ_HOST: RabbitMQ host

RABBITMQ_PORT: RabbitMQ port

RABBITMQ_DEFAULT_USER: RabbitMQ user

RABBITMQ_DEFAULT_PASS: RabbitMQ password

REDIS_HOST: Redis host

REDIS_PORT: Redis port

CONCURRENCY: integer number of concurrent celery tasks

DB_HOST: database host

DB_PORT: (arangodb) database port

DB_NAME: (arangodb) database name

DB_USER: (arangodb) database user

DB_PASS: (arangodb) database password

SEARCHTWEETS_ENDPOINT: Twitter Developer API endpoint

SEARCHTWEETS_BEARER_TOKEN: Twitter Developer API bearer token

SEARCHTWEETS_CONSUMER_KEY: Twitter Developer API key

SEARCHTWEETS_CONSUMER_SECRET: Twitter Developer API secret

CHUNK_SIZE: integer number of camera urls to be retrieved by asynchronous HTTP requests per celery task

QUEUE: name of the queue to push tasks to

Logs

Logs are sent to stdout and stderr. This can be changed in classic-producer/conf/supervise-producer.conf.

Scraper (aka Consumer)

Distributed, asynchronous scraping service of classic images from ALERTWildfire cameras.

Technologies:

Docker
Python 3.9
- Celery (5.1.2)
- requests_html (0.10.0)
Redis (latest)
RabbitMQ (latest)
Google Drive API
Free Proxyscrape API

Environment Variables

RABBITMQ_HOST: RabbitMQ host

RABBITMQ_PORT: RabbitMQ port

RABBITMQ_DEFAULT_USER: RabbitMQ user

RABBITMQ_DEFAULT_PASS: RabbitMQ password

REDIS_HOST: Redis host

REDIS_PORT: Redis port

CONCURRENCY: integer number of concurrent celery tasks

LOGLEVEL: logging level (i.e. info)

QUEUE: name of the queue to retrieve tasks from

DB_HOST: database host

DB_PORT: (arangodb) database port

DB_NAME: (arangodb) database name

DB_USER: (arangodb) database user

DB_PASS: (arangodb) database password

CLIENT_ID: Twitter API client ID

CLIENT_SECRET: Twitter API client secret

PROJECT_ID: Google Drive API project ID

TOKEN: Google Drive API token

REFRESH_TOKEN: Google Drive API refresh token

GDRIVE_PARENT_DIR: ID of Google Drive directory in which to save zip archives of the scraped images

Logs

Logs are sent to stdout and stderr. This can be changed in classic-scraper/conf/supervise-celery.conf.

Infrared Scraper

Producer

Infrared cameras image scraping queue producer.

Technologies:

Docker
ArangoDB (latest)
Python 3.9
- Celery (5.1.2)
Redis (latest)
RabbitMQ (latest)

Environment Variables

RABBITMQ_HOST: RabbitMQ host

RABBITMQ_PORT: RabbitMQ port

RABBITMQ_DEFAULT_USER: RabbitMQ user

RABBITMQ_DEFAULT_PASS: RabbitMQ password

REDIS_HOST: Redis host

REDIS_PORT: Redis port

CONCURRENCY: integer number of concurrent celery tasks

DB_HOST: database host

DB_PORT: (arangodb) database port

DB_NAME: (arangodb) database name

DB_USER: (arangodb) database user

DB_PASS: (arangodb) database password

QUEUE: name of the queue to push tasks to

Logs

Logs are sent to stdout and stderr. This can be changed in infrared-producer/conf/supervise-producer.conf.

Scraper (aka Consumer)

Distributed, asynchronous scraping service of infrared images from ALERTWildfire cameras.

Technologies:

Docker
ArangoDB (latest)
Python 3.9
- Celery (5.1.2)
- requests_html (0.10.0)
Redis (latest)
RabbitMQ (latest)
Google Drive API
Free Proxyscrape API

Environment Variables

RABBITMQ_HOST: RabbitMQ host

RABBITMQ_PORT: RabbitMQ port

RABBITMQ_DEFAULT_USER: RabbitMQ user

RABBITMQ_DEFAULT_PASS: RabbitMQ password

REDIS_HOST: Redis host

REDIS_PORT: Redis port

CONCURRENCY: integer number of concurrent celery tasks

LOGLEVEL: logging level (i.e. info)

QUEUE: name of the queue to retrieve tasks from

DB_HOST: database host

DB_PORT: (arangodb) database port

DB_NAME: (arangodb) database name

DB_USER: (arangodb) database user

DB_PASS: (arangodb) database password

CLIENT_ID: Twitter API client ID

CLIENT_SECRET: Twitter API client secret

PROJECT_ID: Google Drive API project ID

TOKEN: Google Drive API token

REFRESH_TOKEN: Google Drive API refresh token

GDRIVE_PARENT_DIR: ID of Google Drive directory in which to save zip archives of the scraped images

Logs

Logs are sent to stdout and stderr. This can be changed in infrared-scraper/conf/supervise-celery.conf.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
classic-producer		classic-producer
classic-scraper		classic-scraper
infrared-producer		infrared-producer
infrared-scraper		infrared-scraper
media		media
rabbitmq		rabbitmq
scripts		scripts
.gitignore		.gitignore
README.md		README.md
docker-compose.yml.changeme		docker-compose.yml.changeme

frytoli/ALERTWildfire-scraper

Folders and files

Latest commit

History

Repository files navigation

ALERTWildfire Scraper and Tweet Monitor

ALERTWildfire

Contents

Prerequisites

Run It

ArangoDB

Collections

Redis

RabbitMQ

rabbitmq.conf

Classic Scraper

Producer

Environment Variables

Logs

Scraper (aka Consumer)

Environment Variables

Logs

Infrared Scraper

Producer

Environment Variables

Logs

Scraper (aka Consumer)

Environment Variables

Logs

About

Resources

Stars

Watchers

Forks

Languages