Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wish: WARCHdfsBolt with CDX index #567

Open
dportabella opened this issue May 5, 2018 · 3 comments
Open

wish: WARCHdfsBolt with CDX index #567

dportabella opened this issue May 5, 2018 · 3 comments

Comments

@dportabella
Copy link

StormCrawler allows to filter web pages and archive them into WARC archives, as follows:

WARCHdfsBolt warcbolt = (WARCHdfsBolt) new WARCHdfsBolt().withFileNameFormat(fileNameFormat);

TopologyBuilder builder = new TopologyBuilder();

builder.setBolt("warc", warcbolt, numWorkers)
  .localOrShuffleGrouping("parse", WarcStreamName)
  .localOrShuffleGrouping("tika",  WarcStreamName);

Would it be possible to create a CDX index (or JCDX index) for the WARC archives at the same time?

@sebastian-nagel
Copy link
Contributor

Should be possible:

@jnioche jnioche added the wish label May 8, 2018
@jnioche
Copy link
Contributor

jnioche commented May 8, 2018

Doable indeed.

@dportabella the warc bolt is usually connected to the fetcher, not the parsers

builder.setBolt("warc", warcbolt).localOrShuffleGrouping("fetch");

@sebastian-nagel
Copy link
Contributor

Alternatively, the WARC bolt could add WARC file name, record offset and length to the metadata. An indexer (CDX or anything else) then could store it directly which obsoletes the need to index the CDX files in a separate step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants