Clarify overwrite behaviour in documentation of `write_package` #144

hansvancalster · 2023-06-26T08:21:12Z

The documentation of ?write_package() states in the description section: "Writes a Data Package and its related Data Resources to disk as a datapackage.json and CSV files. Already existing CSV files of the same name will not be overwritten."

But I think it is the other way around: "Already existing CSV files will be overwritten."

Internally, write_package() calls write_resource() which in turn uses readr::write_csv(resource$data, file.path(directory, file_name), na = ""). The latter has append = FALSE as default, which occurding to the doc will overwrite existing csv: "If FALSE, will overwrite existing file. If TRUE, will append to existing file. In both cases, if the file does not exist a new file is created."

N.B. I am in favour of keeping the current behaviour, but just want to note that the documentation should probably be fixed. If on the other hand, you actually wanted the behaviour of the documentation, I suggest adding an extra boolean argument overwrite.

The text was updated successfully, but these errors were encountered:

peterdesmet · 2023-06-26T08:52:19Z

Hmm, the intend is not to overwrite existing data files by default. The common use case is having a number of csv files and wanting to add a datapackage.json to the directory to describe those. There you don't want to alter the original data.

You are right that the behaviour of readr::write_csv() is to overwrite the existing file.

cars <- mtcars
write_csv(cars, "cars.csv")
cars <- cars %>% filter(mpg == 21)
write_csv(cars, "cars.csv")

In frictionless, oddly, I don't think that is the case. There is a test for it. Note that I will remove observations_1 and observations_2 from that test, because those are remote files that are never written.

Trying locally, I can replicate files not being overwritten. @hansvancalster have you tested this? Any idea why the behaviour differs from readr?

In any case:

We should probably not rely on undocumented/edge behaviour.
Any use case for wanting files to be overwritten? If so, we could use an overwrite=TRUE (with default FALSE)

hansvancalster · 2023-06-26T10:29:54Z

Below is a reprex where cars.csv is overwritten (and where I also expect it to be overwritten). This differs however from

"The common use case is having a number of csv files and wanting to add a datapackage.json to the directory to describe those."

which is indeed covered by the unit test. The difference between this reprex and the unit test is that the resource is modified in memory? There is no .$resources[[1]]$path variable, but instead a .$resources[[1]]$data variable before the write step.

This is why I was confused about the documentation.

library(frictionless)
#> Warning: package 'frictionless' was built under R version 4.3.1
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data(cars)
data(iris)

p1 <- create_package() %>%
  add_resource("cars", iris) %>%
  write_package()

p2 <- p1 %>%
  remove_resource("cars") %>%
  add_resource("cars", cars) %>%
  write_package()

head(read_resource(p1, "cars"))
#> # A tibble: 6 × 2
#>   Sepal.Length Sepal.Width
#>          <dbl>       <dbl>
#> 1            4           2
#> 2            4          10
#> 3            7           4
#> 4            7          22
#> 5            8          16
#> 6            9          10
head(read_resource(p2, "cars"))
#> # A tibble: 6 × 2
#>   speed  dist
#>   <dbl> <dbl>
#> 1     4     2
#> 2     4    10
#> 3     7     4
#> 4     7    22
#> 5     8    16
#> 6     9    10

^{Created on 2023-06-26 with reprex v2.0.2}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.0 (2023-04-21 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Dutch_Belgium.utf8
#>  ctype    Dutch_Belgium.utf8
#>  tz       Europe/Brussels
#>  date     2023-06-26
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version    date (UTC) lib source
#>  assertthat     0.2.1      2019-03-21 [1] CRAN (R 4.3.0)
#>  bit            4.0.5      2022-11-15 [1] CRAN (R 4.3.0)
#>  bit64          4.0.5      2020-08-30 [1] CRAN (R 4.3.0)
#>  cli            3.6.1      2023-03-23 [1] CRAN (R 4.3.0)
#>  crayon         1.5.2      2022-09-29 [1] CRAN (R 4.3.0)
#>  digest         0.6.31     2022-12-11 [1] CRAN (R 4.3.0)
#>  dplyr        * 1.1.2      2023-04-20 [1] CRAN (R 4.3.0)
#>  evaluate       0.21       2023-05-05 [1] CRAN (R 4.3.0)
#>  fansi          1.0.4      2023-01-22 [1] CRAN (R 4.3.0)
#>  fastmap        1.1.1      2023-02-24 [1] CRAN (R 4.3.0)
#>  frictionless * 1.0.2.9000 2023-06-19 [1] https://inbo.r-universe.dev (R 4.3.1)
#>  fs             1.6.2      2023-04-25 [1] CRAN (R 4.3.0)
#>  generics       0.1.3      2022-07-05 [1] CRAN (R 4.3.0)
#>  glue           1.6.2      2022-02-24 [1] CRAN (R 4.3.0)
#>  hms            1.1.3      2023-03-21 [1] CRAN (R 4.3.0)
#>  htmltools      0.5.5      2023-03-23 [1] CRAN (R 4.3.0)
#>  jsonlite       1.8.4      2022-12-06 [1] CRAN (R 4.3.0)
#>  knitr          1.42       2023-01-25 [1] CRAN (R 4.3.0)
#>  lifecycle      1.0.3      2022-10-07 [1] CRAN (R 4.3.0)
#>  magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.3.0)
#>  pillar         1.9.0      2023-03-22 [1] CRAN (R 4.3.0)
#>  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.3.0)
#>  purrr          1.0.1      2023-01-10 [1] CRAN (R 4.3.0)
#>  R.cache        0.16.0     2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3    1.8.2      2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo           1.25.0     2022-06-12 [1] CRAN (R 4.3.0)
#>  R.utils        2.12.2     2022-11-11 [1] CRAN (R 4.3.0)
#>  R6             2.5.1      2021-08-19 [1] CRAN (R 4.3.0)
#>  readr          2.1.4      2023-02-10 [1] CRAN (R 4.3.0)
#>  reprex         2.0.2      2022-08-17 [1] CRAN (R 4.3.0)
#>  rlang          1.1.1      2023-04-28 [1] CRAN (R 4.3.0)
#>  rmarkdown      2.21       2023-03-26 [1] CRAN (R 4.3.0)
#>  rstudioapi     0.14       2022-08-22 [1] CRAN (R 4.3.0)
#>  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.3.0)
#>  styler         1.9.1      2023-03-04 [1] CRAN (R 4.3.0)
#>  tibble         3.2.1      2023-03-20 [1] CRAN (R 4.3.0)
#>  tidyselect     1.2.0      2022-10-10 [1] CRAN (R 4.3.0)
#>  tzdb           0.3.0      2022-03-28 [1] CRAN (R 4.3.0)
#>  utf8           1.2.3      2023-01-31 [1] CRAN (R 4.3.0)
#>  vctrs          0.6.2      2023-04-19 [1] CRAN (R 4.3.0)
#>  vroom          1.6.3      2023-04-28 [1] CRAN (R 4.3.0)
#>  withr          2.5.0      2022-03-03 [1] CRAN (R 4.3.0)
#>  xfun           0.39       2023-04-20 [1] CRAN (R 4.3.0)
#>  yaml           2.3.7      2023-01-23 [1] CRAN (R 4.3.0)
#> 
#>  [1] C:/R/library
#>  [2] C:/R/R-4.3.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

peterdesmet · 2023-06-26T11:24:57Z

Hmm, you are correct, the local csv is being overwritten by the second write_package(). What we want is:

path is URL: leave as is, don't write
path is local and different from write directory: copy (!) file to new directory
path is local and same as write directory: don't alter file (use case describe above)
data was originally provided when package was loaded (i.e. not a newly added resource): leave as is, don't write
data was added with add_resource(): write to directory
data was added with add_resource() and written to directory again: overwrite (supports the use case where user made mistake and wants to write again)

I believe the above description is the current implementation, but this should be more explicitly covered by tests and documented.

ElsLommelen · 2024-04-12T10:39:08Z

As my suggestion for a feature for write_package() is related to this issue, I add it here.

The fact that CSV files are not changed or removed when the metadata are changed, seems a bit counterintuitive to me: if I change or remove tables in a datapackage, I would like the tables to match the metadata in each version of the datapackage (and visa versa), preferably without having to worry about this myself (or save each version of the datapackage in a separate folder and copy it to the git repo where I store the versions). I understand not everyone prefers to have his/her files overwritten, but would it be possible to add an argument 'overwrite_csv' that defaults to FALSE and could be set to TRUE to allow CSV files to be deleted or overwritten to stay matched with the metadata?

library(frictionless)
#> Warning: package 'frictionless' was built under R version 4.3.3

# create and write a package with 2 tables
create_package() |>
  add_resource(
    resource_name = "iris",
    data = iris
  ) |>
  add_resource(
    resource_name = "cars",
    data = cars
  ) |>
  write_package("testdir")
# together with datapackage.json, 2 csv files are written

# when removing one of tables
read_package("testdir/datapackage.json") |>
  remove_resource("cars") |>
  write_package("testdir")
# cars.csv stays in testdir

# suggestion would be that write_package("testdir", overwrite_CSV = TRUE)
# would remove cars.csv from testdir

# when changing one of the tables
read_package("testdir/datapackage.json") |>
  remove_resource("iris") |>
  add_resource(
    resource_name = "iris",
    data = iris[, c("Sepal.Length", "Sepal.Width", "Species")]
  ) |>
  write_package("testdir")
# iris.csv stays unchanged while the metadata are changed

# suggestion would be that write_package("testdir", overwrite_CSV = TRUE)
# would replace iris.csv in testdir by the new version

^{Created on 2024-04-12 with reprex v2.0.2}

(This issue seems also a bit related to issue #150 as that issue might be caused by the fact that tables and metadata stay not necessariily the same after writing a new version of the datapackage.)

frictionlessdata/frictionless-r#144) (issue #68)

peterdesmet · 2024-04-15T08:13:32Z

Thanks for the suggestion. Reading it quickly, I think this should be a feature of remove_resource():

remove_resource() could check if the data are coming from a csv and if that csv is writable (i.e. not a URL). It could then propose to the user to also remove the csv.
Altering a resource is mainly done through add_resource(), which forbids to add a resource that already exists, so you have to remove it first with remove_resource() if you want to replace it.

It also has the advantage that users can make decisions on a resource rather than a package level. @ElsLommelen does this make sense?

ElsLommelen · 2024-04-15T12:39:07Z

Eh, it depends on how you will implement it. If this feature removes both the csv and metadata of the specific table from the written package, it seems indeed fine to do it in remove_resource().
Just make sure to explain this clearly, as users will not expect that a function as remove_resource() can make changes to the package (because there is a separate function write_package() for this).

ElsLommelen · 2024-04-19T09:31:28Z

A related remark on this proposed solution: when removing a resource and adding it again afterwards, I notice this table always moves to the end in datapackage.json (which gives an ugly change if the file is under version control). Is there any way to keep the initial order, to save the tables default in alphabetical order, or to control the table order in this datapackage file without removing all tables and add them again in the preferred order? (I didn't find this in the documentation, so I suppose there is no feature for this yet?)

peterdesmet · 2024-04-22T14:08:23Z

I didn't find this in the documentation, so I suppose there is no feature for this yet?

Indeed, frictionless does not support a way to control the order of the resources. purrr (or other) packages likely provide a way to reorganize the elements in $resources.

peterdesmet added enhancement New feature or request function:write_package Function write_package() labels Jun 26, 2023

peterdesmet added this to the 1.2.0 milestone Mar 27, 2024

peterdesmet changed the title ~~error in documentation of write_package?~~ Clarify overwrite behaviour in documentation of write_package Mar 27, 2024

ElsLommelen added a commit to inbo/forrescalc that referenced this issue Apr 12, 2024

make sure csv files of datapackage are changed as well when needed (see

93fd546

frictionlessdata/frictionless-r#144) (issue #68)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify overwrite behaviour in documentation of `write_package` #144

Clarify overwrite behaviour in documentation of `write_package` #144

hansvancalster commented Jun 26, 2023

peterdesmet commented Jun 26, 2023

hansvancalster commented Jun 26, 2023

peterdesmet commented Jun 26, 2023

ElsLommelen commented Apr 12, 2024 •

edited

Loading

peterdesmet commented Apr 15, 2024

ElsLommelen commented Apr 15, 2024

ElsLommelen commented Apr 19, 2024

peterdesmet commented Apr 22, 2024

Clarify overwrite behaviour in documentation of write_package #144

Clarify overwrite behaviour in documentation of write_package #144

Comments

hansvancalster commented Jun 26, 2023

peterdesmet commented Jun 26, 2023

hansvancalster commented Jun 26, 2023

peterdesmet commented Jun 26, 2023

ElsLommelen commented Apr 12, 2024 • edited Loading

peterdesmet commented Apr 15, 2024

ElsLommelen commented Apr 15, 2024

ElsLommelen commented Apr 19, 2024

peterdesmet commented Apr 22, 2024

Clarify overwrite behaviour in documentation of `write_package` #144

Clarify overwrite behaviour in documentation of `write_package` #144

ElsLommelen commented Apr 12, 2024 •

edited

Loading