Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web scraping with splashr fails with curl error after many successes #24

Open
ixodid198 opened this issue Jun 30, 2021 · 0 comments
Open

Comments

@ixodid198
Copy link

I am scraping a few dozen URLs using splashr.
The code runs and completes fine when run directly from RStudio Server on my Digital Ocean Droplet. However, when it runs from a cron job it always fails when reading the 24th URL with this error:

Error in curl::curl_fetch_memory(url, handle = handle) : Recv failure: Connection reset by peer

Even when it works when running the code directly from RStudio, I see this error the first 14 scrapes:

QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once.

But it completes OK.

Is there some memory management or garbage collection that I'm supposed to be doing between scrapes? What would account for the success of a direct run and the failure of the same script being run by a cron job? In short, how do I avoid the curl error mentioned above?

library("tidyverse")
library("splashr")
library("rvest")

# Launch SplashR
# system2("docker", args = c("pull scrapinghub/splash:latest"))
# system2("docker", args = c("run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash:latest"), wait = FALSE)
# splash_active()

pause_after_html_read <- 5
pause_after_html_text <- 3

for(idx in 1:28){  
  
  splash(host = "localhost", port = 8050L) |> 
    splash_response_body(FALSE) %>%
    splash_go(url = url_df$web_page[idx]) %>%
    splash_wait(pause_after_html_read) %>%
    splash_html() |> 
    html_text() -> pg
  
    Sys.sleep(pause_after_html_text)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant