-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically split downloads in chunks for queries with >4000 records #29
Labels
enhancement
New feature or request
Comments
Thanks, I adapted this into the following: chunk_size <- 1000
accessions <- query(credentials = credentials,location = "Africa / ...", fast = TRUE)
positions = seq(1, nrow(accessions), by=chunk_size)
is_error <- function(err) inherits(err,'try-error')
chunks = vector("list", length(positions))
# this can be run multiple times to continue downloads
for (index in seq_along(positions)) {
position <- positions[index]
if (is.null(chunks[[index]])) {
message(paste("downloading ", position, chunk_size))
start <- position
end <- min(position + chunk_size, nrow(accessions))
chunk <-
try(download(credentials = credentials,
accessions$accession_id[start:end],
get_sequence = FALSE))
if (is_error(chunk)) {
# refresh credentials and try one more time
credentials <- login(username = username, password = password)
chunk <-
download(credentials = credentials,
accessions$accession_id[start:end],
get_sequence = FALSE)
}
chunk$position <- position
chunks[[index]] <- chunk
Sys.sleep(3)
}
}
if (sum(sapply(chunks, is.null)) == 0) {
# we have downloaded all the chunks
message("download complete")
african_entries = do.call(rbind, chunks)
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Just a small possible enhancement, but would it be possible to have the
download
function automatically split the queries in chunks for searches when the length oflist_of_accession_ids
is >4000.Now I do this myself, e.g. to fetch the most recently uploaded records, using
Even better could be to also have this parallelized (if GISAID would allow that), as the above is still relative slow - it now takes about 1.5 hours to download these 103K records from the last 5 days. If I tried with a chunk size of 5000 I received a Server error, so reduced it to 4000 and that seemed to work...
The text was updated successfully, but these errors were encountered: