-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wish: WARCHdfsBolt with CDX index #567
Comments
Should be possible:
|
Doable indeed. @dportabella the warc bolt is usually connected to the fetcher, not the parsers
|
Closed
Alternatively, the WARC bolt could add WARC file name, record offset and length to the metadata. An indexer (CDX or anything else) then could store it directly which obsoletes the need to index the CDX files in a separate step. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
StormCrawler allows to filter web pages and archive them into WARC archives, as follows:
Would it be possible to create a CDX index (or JCDX index) for the WARC archives at the same time?
The text was updated successfully, but these errors were encountered: