Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: can it continue a suspended job? #55

Open
User670 opened this issue Jul 18, 2020 · 3 comments
Open

Question: can it continue a suspended job? #55

User670 opened this issue Jul 18, 2020 · 3 comments

Comments

@User670
Copy link

User670 commented Jul 18, 2020

Trying to clone a webpage, but it froze after a while, probably due to some network hiccups. I had to kill the process and start over (only to get stuck again, to be honest). Is it possible for this module to continue a suspended job, skipping files that have already been saved?

(Also, what are the time out thresholds and retry limits for the requests? Can I specify these values?)

(Also, can I make it print some logs if a request failed or timed out and is doing a retry?)

Windows 10, Python 3.8.1. Module installed via pip install pywebcopy, module called by command line python -m pywebcopy save_webpage http://y.tuwan.com/chatroom/3701 ./ --bypass_robots.

@rajatomar788
Copy link
Owner

rajatomar788 commented Jul 19, 2020

@User670

Is it possible for this module to continue a suspended job, skipping files that have already been saved?

Yes. Pywebcopy skips files that already exists, so you could consider it being resumed.

(Also, what are the time out thresholds and retry limits for the requests? Can I specify these values?)

No. You have to rerun the scripts/cmd manually i.e. overwrite=False in scripts or without the --overwrite flag in cmd.

(Also, can I make it print some logs if a request failed or timed out and is doing a retry?)

Yes. Set debug=True or --debug flag, then it will print logs which you could manually inspect.

@dibarpyth
Copy link

(Also, what are the time out thresholds and retry limits for the requests? Can I specify these values?)

No. You have to rerun the scripts/cmd manually i.e. overwrite=False in scripts or without the --overwrite flag in cmd.

I think he was talking about the crawl delays between the requests (i.e. timeouts / pauses / wait) to prevent the high load and avoid being banned by the source.

Is it possible to set such a delay between requests? Like "--wait" in WGET.

It would be great for both sides (a source website won't be ddosed and the crawler won't be banned in the middle of the process).

@User670
Copy link
Author

User670 commented Jul 17, 2021

I think he was talking about the crawl delays between the requests (i.e. timeouts / pauses / wait) to prevent the high load and avoid being banned by the source.

Is it possible to set such a delay between requests? Like "--wait" in WGET.

It would be great for both sides (a source website won't be ddosed and the crawler won't be banned in the middle of the process).

I don't think I got banned, and I wasn't talking about delay between requests.

What I was experiencing was, um, like, the crawling just freezes, with no messages being printed to the console for minutes, after a while, and I had to kill the process and start over (otherwise it won't move).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants