Skip to content

Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.

Notifications You must be signed in to change notification settings

DigitalPebble/crawlurlfrontier

Repository files navigation

Crawl with URLFrontier

In the context of the Fed4Fire and NLNet fundings of URL Frontier.

First set the credentials for AWS as well as the FQDN of the master node in a test.properties files.

mvn clean package

Inject the seeds

java -cp ./target/crawlurlfrontier-1.0-SNAPSHOT.jar crawlercommons.urlfrontier.client.Client PutURLs -f top1M.hosts.commoncrawl

before submitting the topology using the storm command:

storm jar target/crawlurlfrontier-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux crawler.flux --filter test.properties

If the cluster is on Docker

docker exec -it nimbus bash
cd crawler
storm jar target/crawlurlfrontier-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux crawler.flux

About

Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.

Resources

Stars

Watchers

Forks