Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make_archive_urls test for valid URL fails #11

Open
lecy opened this issue Jan 4, 2024 · 3 comments
Open

make_archive_urls test for valid URL fails #11

lecy opened this issue Jan 4, 2024 · 3 comments

Comments

@lecy
Copy link
Collaborator

lecy commented Jan 4, 2024

In the make_archive_urls() function within build-catalog-functions.R the test for valid URL is failing.

For example,

x <- "https://urbaninstitute.github.io/nccs-legacy/dictionary/soi/soi_archive_html/SOI-MICRODATA-2002-501C3-CHARITIES-PC"
(RCurl::url.exists(x))
[1] FALSE

The URL works fine:

https://urbaninstitute.github.io/nccs-legacy/dictionary/soi/soi_archive_html/SOI-MICRODATA-2002-501C3-CHARITIES-PC

Any ideas?

@lecy
Copy link
Collaborator Author

lecy commented Jan 4, 2024

Here's some reproducible code to test with the SOI dataset. It is currently returning https://urbaninstitute.github.io/nccs/catalogs/dd_unavailable.html for everything:

library( dplyr )
library( knitr )
library( kableExtra )
library( stringr )
library( flextable )
library( pander )

GH.RAW <- "https://raw.githubusercontent.com/UrbanInstitute/nccs/main/catalogs/"
d <- read.csv( paste0( GH.RAW, "AWS-NCCSDATA.csv" ) )
source( paste0( GH.RAW, "build-catalog-functions.R" ) )


series <- "soi"

paths <- get_file_paths(series = "soi",
                        paths = d$Key,
                        tscope = "CHARITIES",
                        fscope =  "PC" )



                             
                             
profile_urls <- make_archive_urls( series = "soi", paths = paths )  




make_archive_urls <- function(series,
                              paths){
  
  base_url = sprintf("https://urbaninstitute.github.io/nccs-legacy/dictionary/%s/%s_archive_html/",
                     series,
                     series)
  
  expr_dic = list("core" = "legacy/core/",
                  "bmf" = "legacy/bmf/",
                  "misc" = "legacy/misc/",
                  "soi" = "legacy/soi-micro/[0-9]{4}/")
  
  unavail_url <- "https://urbaninstitute.github.io/nccs/catalogs/dd_unavailable.html"
  
  matches <- gsub(expr_dic[[series]], "", paths)
  matches <- gsub("\\.csv", "", matches)
  
  archive_urls <- paste0(base_url, matches)
  archive_urls <- lapply(archive_urls, 
                         function(x) if (RCurl::url.exists(x)) x else unavail_url)
  
  return(archive_urls)  
}

@lecy
Copy link
Collaborator Author

lecy commented Jan 17, 2024

For the time I just commented out the validation line:

  # archive_urls <- lapply(archive_urls, 
  #                        function(x) if (RCurl::url.exists(x)) x else unavail_url)

Worst case the user gets a 404 instead of a "dictionary unavailable" message. Will look into an alternative URL validation function.

@lecy
Copy link
Collaborator Author

lecy commented Jan 22, 2024

I saw your note that you could not replicate the behavior. Same here when I try with this same example:

> x <- "https://urbaninstitute.github.io/nccs-legacy/dictionary/soi/soi_archive_html/SOI-MICRODATA-2002-501C3-CHARITIES-PC"
> (RCurl::url.exists(x))
[1] TRUE

It could have just been a slow server or perhaps those pages are generated dynamically when requested so there is a delay, but whatever the case there are many instances where the RCurl check will fail when the URLs are actually valid.

Unless we have a function that we can trust it's probably better to not remove the links if the test fails because it will result in the kind of file the user mentioned - none of the data dictionary buttons had associated URLS on the download page for the SOI Microdata files (all of the valid ones were dropped when the file was rendered).

If the URL is added and does not actually exist then the user just gets a 404 message. That seems like the lesser of the two problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant