Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference in metadata columns #27

Open
Wytamma opened this issue Jul 30, 2022 · 4 comments
Open

Difference in metadata columns #27

Wytamma opened this issue Jul 30, 2022 · 4 comments

Comments

@Wytamma
Copy link
Owner

Wytamma commented Jul 30, 2022

When I download a metadata file there is no AA Substitutions column... I'm not sure why we have different columns. It seems maybe different users are getting different results from GISAID? #26

@tomwenseleers
Copy link

tomwenseleers commented Jul 30, 2022

Yes I think that's because you use the official interface that GISAID provides; the AA Substitutions column only seems present when one selects records directly on the GISAID website & then presses Download at the bottom, where one can then choose between downloading the metadata or the FASTA sequences...

With manually selected & downloaded GISAID records I get a tsv back with these columns

 [1] "Virus name"                      "Accession ID"                   
 [3] "Collection date"                 "Location"                       
 [5] "Host"                            "Additional location information"
 [7] "Sampling strategy"               "Gender"                         
 [9] "Patient age"                     "Patient status"                 
[11] "Last vaccinated"                 "Passage"                        
[13] "Specimen"                        "Additional host information"    
[15] "Lineage"                         "Clade"                          
[17] "AA Substitutions"

It's only when I used the GISAIDR download function that I get the columns

[1] "id"                      "virus_name"             
[3] "passage_details_history" "accession_id"           
[5] "collection_date"         "submission_date"        
[7] "information"             "length"                 
[9] "host"                    "location"               
[11] "originating_lab"         "submitting_lab"

which misses the AA Substitutions field (confirmed by directly inspecting the gisaidr_data_tmp.tar file)...

I think getting a tsv back with all the columns included could be supported if the download would be driven via RSelenium, similar to how I download the GISAID batch download packages that are available, https://stackoverflow.com/questions/72632118/download-covid-patient-metadata-from-gisaid-website-in-r-using-rselenium.

This would involve:
(1) enter username & password at https://www.epicov.org/epi3/frontend and press Login button
(2) press Search tab
(3) press Select tab at the bottom
(4) paste GISAID access nrs (no more than 10 000 at a time) (or point to csv file with desired access nrs)
(5) press OK button
(6) press Download button at the bottom
(7) Select Patient status metadata or Nucleotide sequences (FASTA)
(8) press Download

Aside from downloading particular records in this way (which should also get the AA substitutions field), I think supporting the download of the batch download packages via RSelenium could be cool too, but you would probably just have to put it in a separate function, as one can then only download the whole database (download+reading it in in R then just takes 2 mins), and not a particular subset.

@Wytamma
Copy link
Owner Author

Wytamma commented Jul 30, 2022

Hi @tomwenseleers, I’m not using the offical GISAID interface (none exists as far as I can tell). GISAIDR just sends the equivalent HTTP requests that you send when using the website. I think the problem here is that we have different versions of GISAID? This is what my download panel looks like. There is no Patient status metadata or Nucleotide sequences (FASTA) option only Augur or acknowledgements. When I press download I get a zip that combines metadata and the sequences. Can you please double check the URLs for the steps above? My url is https://www.epicov.org/epi3/frontend ie /frontend. If I use https://www.epicov.org/epi3 without /frontend I get a 404 error.
AD7DDCD2-74C8-433E-AB8A-00CD7FD65DCE

@tomwenseleers
Copy link

tomwenseleers commented Jul 31, 2022

Ha sorry. What a shame then - it seems GISAID somehow decided to give different users different tiers of access or what? How is one supposed to write reproducible code to drive this?

The URL I get to start with is
https://www.epicov.org/epi3/start
which then gets me to
https://www.epicov.org/epi3/frontend#174123
but the #XXXXXX nr at the end is different each time I login.

If I use GISAIDR I also get back a .tar file with sequences & metadata combined, and with metadata lacking that AA substitutions field. It is this that confused me, because if I manually log in to the GISAID website and select some records and press Download at the bottom I get this
GISAID record download
and I can download the metadata & sequences separately.

Aside from that I also have batch package download options available when I press on the Downloads button at the top of the page which for me looks like
GISAID package download1
GISAID package download2
I know that the Genomic epidemiology tab is missing for most, but I thought everyone at least would have access to the tab Download packages with metadata? Or is that not the case? And would the fields & columns you get back also differ per user?

@tomwenseleers
Copy link

tomwenseleers commented Jul 31, 2022

For the record, with my login & credentials, this is how I managed to download a separate metadatafile with all the columns I was given access to & the code given can also still be modified a bit to allow download of the FASTA; this is using RSelenium (so a bit different than your httr approach). It also shows how to get the most recently uploaded records that are absent in the download package: https://stackoverflow.com/questions/72632118/download-covid-patient-metadata-from-gisaid-website-in-r-using-rselenium

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants