Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify overwrite behaviour in documentation of write_package #144

Open
hansvancalster opened this issue Jun 26, 2023 · 8 comments
Open
Labels
enhancement New feature or request function:write_package Function write_package()
Milestone

Comments

@hansvancalster
Copy link
Contributor

The documentation of ?write_package() states in the description section: "Writes a Data Package and its related Data Resources to disk as a datapackage.json and CSV files. Already existing CSV files of the same name will not be overwritten."

But I think it is the other way around: "Already existing CSV files will be overwritten."

Internally, write_package() calls write_resource() which in turn uses readr::write_csv(resource$data, file.path(directory, file_name), na = ""). The latter has append = FALSE as default, which occurding to the doc will overwrite existing csv: "If FALSE, will overwrite existing file. If TRUE, will append to existing file. In both cases, if the file does not exist a new file is created."

N.B. I am in favour of keeping the current behaviour, but just want to note that the documentation should probably be fixed. If on the other hand, you actually wanted the behaviour of the documentation, I suggest adding an extra boolean argument overwrite.

@peterdesmet
Copy link
Member

Hmm, the intend is not to overwrite existing data files by default. The common use case is having a number of csv files and wanting to add a datapackage.json to the directory to describe those. There you don't want to alter the original data.

  1. You are right that the behaviour of readr::write_csv() is to overwrite the existing file.
cars <- mtcars
write_csv(cars, "cars.csv")
cars <- cars %>% filter(mpg == 21)
write_csv(cars, "cars.csv")
  1. In frictionless, oddly, I don't think that is the case. There is a test for it. Note that I will remove observations_1 and observations_2 from that test, because those are remote files that are never written.

Trying locally, I can replicate files not being overwritten. @hansvancalster have you tested this? Any idea why the behaviour differs from readr?

In any case:

  1. We should probably not rely on undocumented/edge behaviour.
  2. Any use case for wanting files to be overwritten? If so, we could use an overwrite=TRUE (with default FALSE)

@peterdesmet peterdesmet added enhancement New feature or request function:write_package Function write_package() labels Jun 26, 2023
@hansvancalster
Copy link
Contributor Author

Below is a reprex where cars.csv is overwritten (and where I also expect it to be overwritten). This differs however from

"The common use case is having a number of csv files and wanting to add a datapackage.json to the directory to describe those."

which is indeed covered by the unit test. The difference between this reprex and the unit test is that the resource is modified in memory? There is no .$resources[[1]]$path variable, but instead a .$resources[[1]]$data variable before the write step.

This is why I was confused about the documentation.

library(frictionless)
#> Warning: package 'frictionless' was built under R version 4.3.1
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data(cars)
data(iris)

p1 <- create_package() %>%
  add_resource("cars", iris) %>%
  write_package()

p2 <- p1 %>%
  remove_resource("cars") %>%
  add_resource("cars", cars) %>%
  write_package()

head(read_resource(p1, "cars"))
#> # A tibble: 6 × 2
#>   Sepal.Length Sepal.Width
#>          <dbl>       <dbl>
#> 1            4           2
#> 2            4          10
#> 3            7           4
#> 4            7          22
#> 5            8          16
#> 6            9          10
head(read_resource(p2, "cars"))
#> # A tibble: 6 × 2
#>   speed  dist
#>   <dbl> <dbl>
#> 1     4     2
#> 2     4    10
#> 3     7     4
#> 4     7    22
#> 5     8    16
#> 6     9    10

Created on 2023-06-26 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.0 (2023-04-21 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Dutch_Belgium.utf8
#>  ctype    Dutch_Belgium.utf8
#>  tz       Europe/Brussels
#>  date     2023-06-26
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version    date (UTC) lib source
#>  assertthat     0.2.1      2019-03-21 [1] CRAN (R 4.3.0)
#>  bit            4.0.5      2022-11-15 [1] CRAN (R 4.3.0)
#>  bit64          4.0.5      2020-08-30 [1] CRAN (R 4.3.0)
#>  cli            3.6.1      2023-03-23 [1] CRAN (R 4.3.0)
#>  crayon         1.5.2      2022-09-29 [1] CRAN (R 4.3.0)
#>  digest         0.6.31     2022-12-11 [1] CRAN (R 4.3.0)
#>  dplyr        * 1.1.2      2023-04-20 [1] CRAN (R 4.3.0)
#>  evaluate       0.21       2023-05-05 [1] CRAN (R 4.3.0)
#>  fansi          1.0.4      2023-01-22 [1] CRAN (R 4.3.0)
#>  fastmap        1.1.1      2023-02-24 [1] CRAN (R 4.3.0)
#>  frictionless * 1.0.2.9000 2023-06-19 [1] https://inbo.r-universe.dev (R 4.3.1)
#>  fs             1.6.2      2023-04-25 [1] CRAN (R 4.3.0)
#>  generics       0.1.3      2022-07-05 [1] CRAN (R 4.3.0)
#>  glue           1.6.2      2022-02-24 [1] CRAN (R 4.3.0)
#>  hms            1.1.3      2023-03-21 [1] CRAN (R 4.3.0)
#>  htmltools      0.5.5      2023-03-23 [1] CRAN (R 4.3.0)
#>  jsonlite       1.8.4      2022-12-06 [1] CRAN (R 4.3.0)
#>  knitr          1.42       2023-01-25 [1] CRAN (R 4.3.0)
#>  lifecycle      1.0.3      2022-10-07 [1] CRAN (R 4.3.0)
#>  magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.3.0)
#>  pillar         1.9.0      2023-03-22 [1] CRAN (R 4.3.0)
#>  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.3.0)
#>  purrr          1.0.1      2023-01-10 [1] CRAN (R 4.3.0)
#>  R.cache        0.16.0     2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3    1.8.2      2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo           1.25.0     2022-06-12 [1] CRAN (R 4.3.0)
#>  R.utils        2.12.2     2022-11-11 [1] CRAN (R 4.3.0)
#>  R6             2.5.1      2021-08-19 [1] CRAN (R 4.3.0)
#>  readr          2.1.4      2023-02-10 [1] CRAN (R 4.3.0)
#>  reprex         2.0.2      2022-08-17 [1] CRAN (R 4.3.0)
#>  rlang          1.1.1      2023-04-28 [1] CRAN (R 4.3.0)
#>  rmarkdown      2.21       2023-03-26 [1] CRAN (R 4.3.0)
#>  rstudioapi     0.14       2022-08-22 [1] CRAN (R 4.3.0)
#>  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.3.0)
#>  styler         1.9.1      2023-03-04 [1] CRAN (R 4.3.0)
#>  tibble         3.2.1      2023-03-20 [1] CRAN (R 4.3.0)
#>  tidyselect     1.2.0      2022-10-10 [1] CRAN (R 4.3.0)
#>  tzdb           0.3.0      2022-03-28 [1] CRAN (R 4.3.0)
#>  utf8           1.2.3      2023-01-31 [1] CRAN (R 4.3.0)
#>  vctrs          0.6.2      2023-04-19 [1] CRAN (R 4.3.0)
#>  vroom          1.6.3      2023-04-28 [1] CRAN (R 4.3.0)
#>  withr          2.5.0      2022-03-03 [1] CRAN (R 4.3.0)
#>  xfun           0.39       2023-04-20 [1] CRAN (R 4.3.0)
#>  yaml           2.3.7      2023-01-23 [1] CRAN (R 4.3.0)
#> 
#>  [1] C:/R/library
#>  [2] C:/R/R-4.3.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

@peterdesmet
Copy link
Member

Hmm, you are correct, the local csv is being overwritten by the second write_package(). What we want is:

  • path is URL: leave as is, don't write
  • path is local and different from write directory: copy (!) file to new directory
  • path is local and same as write directory: don't alter file (use case describe above)
  • data was originally provided when package was loaded (i.e. not a newly added resource): leave as is, don't write
  • data was added with add_resource(): write to directory
  • data was added with add_resource() and written to directory again: overwrite (supports the use case where user made mistake and wants to write again)

I believe the above description is the current implementation, but this should be more explicitly covered by tests and documented.

@peterdesmet peterdesmet added this to the 1.2.0 milestone Mar 27, 2024
@peterdesmet peterdesmet changed the title error in documentation of write_package? Clarify overwrite behaviour in documentation of write_package Mar 27, 2024
@ElsLommelen
Copy link

ElsLommelen commented Apr 12, 2024

As my suggestion for a feature for write_package() is related to this issue, I add it here.

The fact that CSV files are not changed or removed when the metadata are changed, seems a bit counterintuitive to me: if I change or remove tables in a datapackage, I would like the tables to match the metadata in each version of the datapackage (and visa versa), preferably without having to worry about this myself (or save each version of the datapackage in a separate folder and copy it to the git repo where I store the versions). I understand not everyone prefers to have his/her files overwritten, but would it be possible to add an argument 'overwrite_csv' that defaults to FALSE and could be set to TRUE to allow CSV files to be deleted or overwritten to stay matched with the metadata?

library(frictionless)
#> Warning: package 'frictionless' was built under R version 4.3.3

# create and write a package with 2 tables
create_package() |>
  add_resource(
    resource_name = "iris",
    data = iris
  ) |>
  add_resource(
    resource_name = "cars",
    data = cars
  ) |>
  write_package("testdir")
# together with datapackage.json, 2 csv files are written

# when removing one of tables
read_package("testdir/datapackage.json") |>
  remove_resource("cars") |>
  write_package("testdir")
# cars.csv stays in testdir

# suggestion would be that write_package("testdir", overwrite_CSV = TRUE)
# would remove cars.csv from testdir

# when changing one of the tables
read_package("testdir/datapackage.json") |>
  remove_resource("iris") |>
  add_resource(
    resource_name = "iris",
    data = iris[, c("Sepal.Length", "Sepal.Width", "Species")]
  ) |>
  write_package("testdir")
# iris.csv stays unchanged while the metadata are changed

# suggestion would be that write_package("testdir", overwrite_CSV = TRUE)
# would replace iris.csv in testdir by the new version

Created on 2024-04-12 with reprex v2.0.2

(This issue seems also a bit related to issue #150 as that issue might be caused by the fact that tables and metadata stay not necessariily the same after writing a new version of the datapackage.)

@peterdesmet
Copy link
Member

Thanks for the suggestion. Reading it quickly, I think this should be a feature of remove_resource():

  • remove_resource() could check if the data are coming from a csv and if that csv is writable (i.e. not a URL). It could then propose to the user to also remove the csv.
  • Altering a resource is mainly done through add_resource(), which forbids to add a resource that already exists, so you have to remove it first with remove_resource() if you want to replace it.

It also has the advantage that users can make decisions on a resource rather than a package level. @ElsLommelen does this make sense?

@ElsLommelen
Copy link

Eh, it depends on how you will implement it. If this feature removes both the csv and metadata of the specific table from the written package, it seems indeed fine to do it in remove_resource().
Just make sure to explain this clearly, as users will not expect that a function as remove_resource() can make changes to the package (because there is a separate function write_package() for this).

@ElsLommelen
Copy link

A related remark on this proposed solution: when removing a resource and adding it again afterwards, I notice this table always moves to the end in datapackage.json (which gives an ugly change if the file is under version control). Is there any way to keep the initial order, to save the tables default in alphabetical order, or to control the table order in this datapackage file without removing all tables and add them again in the preferred order? (I didn't find this in the documentation, so I suppose there is no feature for this yet?)

@peterdesmet
Copy link
Member

I didn't find this in the documentation, so I suppose there is no feature for this yet?

Indeed, frictionless does not support a way to control the order of the resources. purrr (or other) packages likely provide a way to reorganize the elements in $resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request function:write_package Function write_package()
Projects
None yet
Development

No branches or pull requests

3 participants