wish: WARCHdfsBolt with CDX index #567

dportabella · 2018-05-05T18:45:46Z

StormCrawler allows to filter web pages and archive them into WARC archives, as follows:

WARCHdfsBolt warcbolt = (WARCHdfsBolt) new WARCHdfsBolt().withFileNameFormat(fileNameFormat);

TopologyBuilder builder = new TopologyBuilder();

builder.setBolt("warc", warcbolt, numWorkers)
  .localOrShuffleGrouping("parse", WarcStreamName)
  .localOrShuffleGrouping("tika",  WarcStreamName);

Would it be possible to create a CDX index (or JCDX index) for the WARC archives at the same time?

The text was updated successfully, but these errors were encountered:

sebastian-nagel · 2018-05-08T08:39:49Z

Should be possible:

cf. WarcCdxWriter for Nutch
should ev. exclude most of the dependencies of webarchive-commons which is required to get the SURT key

jnioche · 2018-05-08T10:34:04Z

Doable indeed.

@dportabella the warc bolt is usually connected to the fetcher, not the parsers

builder.setBolt("warc", warcbolt).localOrShuffleGrouping("fetch");

sebastian-nagel · 2019-09-23T14:27:34Z

Alternatively, the WARC bolt could add WARC file name, record offset and length to the metadata. An indexer (CDX or anything else) then could store it directly which obsoletes the need to index the CDX files in a separate step.

jnioche added the wish label May 8, 2018

jnioche added the warc label Aug 23, 2018

sebastian-nagel mentioned this issue Sep 23, 2019

Implement WARC spout #755

Closed

sebastian-nagel mentioned this issue Jun 21, 2020

WARC spout to emit captures into topology (implements #755) #799

Merged

sebastian-nagel mentioned this issue Feb 26, 2023

WARCHdfsBolt forwarding WARC file path to StatusUpdaterBolt #1044

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wish: WARCHdfsBolt with CDX index #567

wish: WARCHdfsBolt with CDX index #567

dportabella commented May 5, 2018

sebastian-nagel commented May 8, 2018

jnioche commented May 8, 2018

sebastian-nagel commented Sep 23, 2019

wish: WARCHdfsBolt with CDX index #567

wish: WARCHdfsBolt with CDX index #567

Comments

dportabella commented May 5, 2018

sebastian-nagel commented May 8, 2018

jnioche commented May 8, 2018

sebastian-nagel commented Sep 23, 2019