-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify overwrite behaviour in documentation of write_package
#144
Comments
Hmm, the intend is not to overwrite existing data files by default. The common use case is having a number of csv files and wanting to add a
cars <- mtcars
write_csv(cars, "cars.csv")
cars <- cars %>% filter(mpg == 21)
write_csv(cars, "cars.csv")
Trying locally, I can replicate files not being overwritten. @hansvancalster have you tested this? Any idea why the behaviour differs from readr? In any case:
|
Below is a reprex where
which is indeed covered by the unit test. The difference between this reprex and the unit test is that the resource is modified in memory? There is no This is why I was confused about the documentation. library(frictionless)
#> Warning: package 'frictionless' was built under R version 4.3.1
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data(cars)
data(iris)
p1 <- create_package() %>%
add_resource("cars", iris) %>%
write_package()
p2 <- p1 %>%
remove_resource("cars") %>%
add_resource("cars", cars) %>%
write_package()
head(read_resource(p1, "cars"))
#> # A tibble: 6 × 2
#> Sepal.Length Sepal.Width
#> <dbl> <dbl>
#> 1 4 2
#> 2 4 10
#> 3 7 4
#> 4 7 22
#> 5 8 16
#> 6 9 10
head(read_resource(p2, "cars"))
#> # A tibble: 6 × 2
#> speed dist
#> <dbl> <dbl>
#> 1 4 2
#> 2 4 10
#> 3 7 4
#> 4 7 22
#> 5 8 16
#> 6 9 10 Created on 2023-06-26 with reprex v2.0.2 Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.0 (2023-04-21 ucrt)
#> os Windows 10 x64 (build 19044)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate Dutch_Belgium.utf8
#> ctype Dutch_Belgium.utf8
#> tz Europe/Brussels
#> date 2023-06-26
#> pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.3.0)
#> bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0)
#> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0)
#> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
#> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)
#> digest 0.6.31 2022-12-11 [1] CRAN (R 4.3.0)
#> dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)
#> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)
#> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
#> frictionless * 1.0.2.9000 2023-06-19 [1] https://inbo.r-universe.dev (R 4.3.1)
#> fs 1.6.2 2023-04-25 [1] CRAN (R 4.3.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
#> hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
#> htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.0)
#> jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.3.0)
#> knitr 1.42 2023-01-25 [1] CRAN (R 4.3.0)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
#> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
#> readr 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.0)
#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
#> rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.3.0)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.3.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
#> styler 1.9.1 2023-03-04 [1] CRAN (R 4.3.0)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
#> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.3.0)
#> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)
#> vctrs 0.6.2 2023-04-19 [1] CRAN (R 4.3.0)
#> vroom 1.6.3 2023-04-28 [1] CRAN (R 4.3.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)
#> xfun 0.39 2023-04-20 [1] CRAN (R 4.3.0)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
#>
#> [1] C:/R/library
#> [2] C:/R/R-4.3.0/library
#>
#> ────────────────────────────────────────────────────────────────────────────── |
Hmm, you are correct, the local csv is being overwritten by the second
I believe the above description is the current implementation, but this should be more explicitly covered by tests and documented. |
write_package
?write_package
As my suggestion for a feature for The fact that CSV files are not changed or removed when the metadata are changed, seems a bit counterintuitive to me: if I change or remove tables in a datapackage, I would like the tables to match the metadata in each version of the datapackage (and visa versa), preferably without having to worry about this myself (or save each version of the datapackage in a separate folder and copy it to the git repo where I store the versions). I understand not everyone prefers to have his/her files overwritten, but would it be possible to add an argument 'overwrite_csv' that defaults to FALSE and could be set to TRUE to allow CSV files to be deleted or overwritten to stay matched with the metadata? library(frictionless)
#> Warning: package 'frictionless' was built under R version 4.3.3
# create and write a package with 2 tables
create_package() |>
add_resource(
resource_name = "iris",
data = iris
) |>
add_resource(
resource_name = "cars",
data = cars
) |>
write_package("testdir")
# together with datapackage.json, 2 csv files are written
# when removing one of tables
read_package("testdir/datapackage.json") |>
remove_resource("cars") |>
write_package("testdir")
# cars.csv stays in testdir
# suggestion would be that write_package("testdir", overwrite_CSV = TRUE)
# would remove cars.csv from testdir
# when changing one of the tables
read_package("testdir/datapackage.json") |>
remove_resource("iris") |>
add_resource(
resource_name = "iris",
data = iris[, c("Sepal.Length", "Sepal.Width", "Species")]
) |>
write_package("testdir")
# iris.csv stays unchanged while the metadata are changed
# suggestion would be that write_package("testdir", overwrite_CSV = TRUE)
# would replace iris.csv in testdir by the new version Created on 2024-04-12 with reprex v2.0.2 (This issue seems also a bit related to issue #150 as that issue might be caused by the fact that tables and metadata stay not necessariily the same after writing a new version of the datapackage.) |
Thanks for the suggestion. Reading it quickly, I think this should be a feature of
It also has the advantage that users can make decisions on a resource rather than a package level. @ElsLommelen does this make sense? |
Eh, it depends on how you will implement it. If this feature removes both the csv and metadata of the specific table from the written package, it seems indeed fine to do it in |
A related remark on this proposed solution: when removing a resource and adding it again afterwards, I notice this table always moves to the end in |
Indeed, frictionless does not support a way to control the order of the resources. purrr (or other) packages likely provide a way to reorganize the elements in |
The documentation of
?write_package()
states in the description section: "Writes a Data Package and its related Data Resources to disk as a datapackage.json and CSV files. Already existing CSV files of the same name will not be overwritten."But I think it is the other way around: "Already existing CSV files will be overwritten."
Internally,
write_package()
callswrite_resource()
which in turn usesreadr::write_csv(resource$data, file.path(directory, file_name), na = "")
. The latter hasappend = FALSE
as default, which occurding to the doc will overwrite existing csv: "If FALSE, will overwrite existing file. If TRUE, will append to existing file. In both cases, if the file does not exist a new file is created."N.B. I am in favour of keeping the current behaviour, but just want to note that the documentation should probably be fixed. If on the other hand, you actually wanted the behaviour of the documentation, I suggest adding an extra boolean argument
overwrite
.The text was updated successfully, but these errors were encountered: