Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically split downloads in chunks for queries with >4000 records #29

Open
tomwenseleers opened this issue Jul 30, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@tomwenseleers
Copy link

tomwenseleers commented Jul 30, 2022

Just a small possible enhancement, but would it be possible to have the download function automatically split the queries in chunks for searches when the length of list_of_accession_ids is >4000.

Now I do this myself, e.g. to fetch the most recently uploaded records, using

df= query(
  credentials = credentials, 
  from_subm = as.character(GISAID_max_submdate), 
  to_subm = as.character(today),
  fast = TRUE
)
dim(df) # 103356      1
# function to split vector in chunks of max size chunk_length
chunk = function(x, chunk_length=4000) split(x, ceiling(seq_along(x)/chunk_length))

chunks = chunk(df$accession_id)
downloads = do.call(rbind, lapply(1:length(chunks),
                   function (chunk) {
                     message(paste0("Downloading batch ", chunk, " out of ", length(chunks)))
                     Sys.sleep(3)
                     return(download(credentials = credentials, 
                              list_of_accession_ids = chunks[[chunk]])) } ))
dim(downloads) # 103356     29
names(downloads)
# [1] "strain"                "virus"                 "accession_id"         
# [4] "genbank_accession"     "date"                  "region"               
# [7] "country"               "division"              "location"             
# [10] "region_exposure"       "country_exposure"      "division_exposure"    
# [13] "segment"               "length"                "host"                 
# [16] "age"                   "sex"                   "Nextstrain_clade"     
# [19] "pangolin_lineage"      "GISAID_clade"          "originating_lab"      
# [22] "submitting_lab"        "authors"               "url"                  
# [25] "title"                 "paper_url"             "date_submitted"       
# [28] "purpose_of_sequencing" "sequence"  

Even better could be to also have this parallelized (if GISAID would allow that), as the above is still relative slow - it now takes about 1.5 hours to download these 103K records from the last 5 days. If I tried with a chunk size of 5000 I received a Server error, so reduced it to 4000 and that seemed to work...

@Wytamma Wytamma added the enhancement New feature or request label Jul 30, 2022
@pvanheus
Copy link

Thanks, I adapted this into the following:

chunk_size <- 1000

accessions <- query(credentials = credentials,location = "Africa / ...", fast = TRUE)
positions = seq(1, nrow(accessions), by=chunk_size)
is_error <- function(err) inherits(err,'try-error')

chunks = vector("list", length(positions))

# this can be run multiple times to continue downloads
for (index in seq_along(positions)) {
  position <- positions[index]
  if (is.null(chunks[[index]])) {
    message(paste("downloading ", position, chunk_size))
    start <- position
    end <- min(position + chunk_size, nrow(accessions))
    chunk <-
      try(download(credentials = credentials,
                   accessions$accession_id[start:end],
                   get_sequence = FALSE))
    if (is_error(chunk)) {
      # refresh credentials and try one more time
      credentials <- login(username = username, password = password)
      chunk <-
        download(credentials = credentials,
                 accessions$accession_id[start:end],
                 get_sequence = FALSE)
    }
    chunk$position <- position
    chunks[[index]] <- chunk
    Sys.sleep(3)
  }
}

if (sum(sapply(chunks, is.null)) == 0) {
  # we have downloaded all the chunks
  message("download complete")
  african_entries = do.call(rbind, chunks)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants