Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider an option to generate WACZ files after a crawl is done for better replay with ReplayWeb.page #179

Open
ikreymer opened this issue Feb 21, 2021 · 1 comment

Comments

@ikreymer
Copy link

ikreymer commented Feb 21, 2021

It would be helpful for folks using grab-site and then replaying via replayweb.page to have grab-site generate a WACZ file after the crawl is done. (This workflow is mentioned in webrecorder/replayweb.page#6)

WACZ (https://github.com/webrecorder/wacz-format) provides a way to package the WARC, CDX and an optional page list into a single file (a zip file) such that it can be loaded quickly for replay.

The Python wacz library (https://pypi.org/project/wacz) can be used to create the WACZ package (https://github.com/webrecorder/wacz-format/tree/main/py-wacz)

I think should just be able to call the create command from:
https://github.com/webrecorder/wacz-format/blob/main/py-wacz/wacz/main.py#L19

It might make sense to pass in a page list, and there is an experimental option to do full-text extraction on pages as well.

The library is still new, so can definitely make any changes needed to support integration!

@ivan
Copy link
Contributor

ivan commented Feb 23, 2021

grab-site currently doesn't really have anyone developing it (I just try to keep the install steps working), but I have no objections to the addition of WACZ support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants