Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is it possible to output regular files instead of warc? #228

Open
ftc2 opened this issue Apr 7, 2023 · 6 comments
Open

is it possible to output regular files instead of warc? #228

ftc2 opened this issue Apr 7, 2023 · 6 comments

Comments

@ftc2
Copy link

ftc2 commented Apr 7, 2023

i only want files, not warc.

can grab-site output regular files (like html and images) for me like wget can? (links must be converted to relative links)

side question: has anyone here actually had good results with getting files back out of warc? this wouldn't be such a big deal if that were possible. i've never seen a util that can exract files from warcs with 100% success rate (and it's usually insanely slow).

i've tried:

  • jwat-tools: seemed the best coded of the bunch but gave me nonsensical filenames like extracted.001, and idk how to get past that
  • warcat: slow and fails on many warcs
  • warc-extractor: the easiest to use of the bunch (it can hit a bunch of warcs in a single dir), but it's insanely slow, and it also fails on many warcs
  • the unarchiver: fails on some warcs
@TheTechRobo
Copy link
Contributor

May be of interest to you: https://replayweb.page/ can load WARCs and allow you to browse them. It works best on websites that don't heavily rely on JavaScript.

I'd suggest to use wpull on its own (grab-site is basically wpull but tuned for easier crawling) but the current state of wpull outside of wrappers like this is awful. :/

@ftc2
Copy link
Author

ftc2 commented Apr 7, 2023

thanks. i'm familiar with replayweb, but warc is really not for me.

i want the option to be able to do things like:

  • host the archive as static content on nginx
  • iterate over files to scrape content with certain tools

it's just easier for me to work with files.

tbh, i would just use wget, but i'm having problems with it staying logged in even when using the various cookie options. sigh

i've tried:

  • --load-cookies exported_from_firefox.txt --keep-session-cookies
  • --load-cookies exported_from_firefox.txt --keep-session-cookies --save-cookies exported_from_firefox.txt

neither works. any tips?

it's very frustrating because i've had luck using curl with the same cookie file like this:

  • --cookie exported_from_firefox.txt --cookie-jar exported_from_firefox.txt

but curl has no crawling functionality.

@TheTechRobo
Copy link
Contributor

Does grab-site work with the cookie issue?

Go into the exported_from_firefox.txt file and check for any #HtttpOnly lines. Those are a common problem with cookies.txt parsers as they aren't part of any official specification. I've had luck occasionally with removing the #HttpOnly from the beginning of the line (don't do that for the dot though, I don't think) but your mileage may vary.

@ftc2
Copy link
Author

ftc2 commented Apr 7, 2023

i was so super frustrated with trying to extract files from old WARCs from another project that i didn't even bother trying grab-site without first determining that it could save plain files, haha. that's kind of a prerequisite for me now.

httrack is starting to look like one of the only candidates at this point.

i'll look into your cookie tips and see if i can get wget working first though since i'm already pretty familiar with wget.

@ftc2
Copy link
Author

ftc2 commented Apr 7, 2023

at first glance, i think your #HttpOnly tip fixed it for me. i'll stick with wget for now until i need something more complex. many thanks.

@TomLucidor
Copy link

@TheTechRobo Seconding this about plain HTML files but for the reason of plugging it into AI document parsers like Khoj or GPT4All, summarizing blogs and making personal assistants out of it is kinda lit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants