Frictionless silently adds columns not defined in the schema (add `schema_sync`) #150

peterdesmet · 2023-09-13T15:16:41Z

example.zip contains data.csv with 3 columns. Only the first two are defined in datapackage.json.

name (string)
identifier (integer)
include (boolean)

Frictionless will silently include the extra column include and name it X3:

> p <- read_package("example/datapackage.json")
Please make sure you have the right to access data from this Data Package for your intended use.
Follow applicable norms or requirements to credit the dataset and its authors.
> d <- read_resource(p, "data")                                                                                                                      
> d
# A tibble: 4 × 3
  name     identifier X3   
  <chr>         <dbl> <lgl>
1 oconnell          1 TRUE 
2 rovero            2 TRUE 
3 cadman            3 FALSE
4 burton            4 FALSE

However, if the extra column is not in the middle (name, include, identifier, see example2.zip), frictionless will throw a parsing issue for the second column, and still add the last column:

> p <- read_package("example/datapackage.json")
Please make sure you have the right to access data from this Data Package for your intended use.
Follow applicable norms or requirements to credit the dataset and its authors.
> d <- read_resource(p, "data")
Warning message:                                                                                                                      
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat) 
> d
# A tibble: 4 × 3
  name     identifier    X3
  <chr>         <dbl> <dbl>
1 oconnell         NA     1
2 rovero           NA     2
3 cadman           NA     3
4 burton           NA     4

A col_select does not circumvent this issue:

> d <- read_resource(p, "data", col_select = c("name", "identifier"))
Warning message:                                                                                                                      
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat) 
> d
# A tibble: 4 × 2
  name     identifier
  <chr>         <dbl>
1 oconnell         NA
2 rovero           NA
3 cadman           NA
4 burton           NA

I think this behaviour should be documented better. Ideally, we have a schema_sync parameter in read_resource() which compares column headers with the schema (#127) and:

If false (default):

Warn for name mismatch
Error for order mismatch
Error for extra columns (different than current behavior)

If true:

Return all columns, in the order of the header in the data (cf. Frictionless Framework)
Applies type, enum, etc. if a matching column is found in the schema
Guesses type if column is not defined in the schema

schema_sync = true then allows more loose reading of data, something that would be beneficial to e.g. the bioRad package

The text was updated successfully, but these errors were encountered:

peterdesmet added enhancement New feature or request function:read_resource Function read_resource() labels Sep 13, 2023

peterdesmet added this to the 1.2.0 milestone Mar 27, 2024

ElsLommelen mentioned this issue Apr 12, 2024

Clarify overwrite behaviour in documentation of write_package #144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frictionless silently adds columns not defined in the schema (add `schema_sync`) #150

Frictionless silently adds columns not defined in the schema (add `schema_sync`) #150

peterdesmet commented Sep 13, 2023 •

edited

Loading

Frictionless silently adds columns not defined in the schema (add schema_sync) #150

Frictionless silently adds columns not defined in the schema (add schema_sync) #150

Comments

peterdesmet commented Sep 13, 2023 • edited Loading

Frictionless silently adds columns not defined in the schema (add `schema_sync`) #150

Frictionless silently adds columns not defined in the schema (add `schema_sync`) #150

peterdesmet commented Sep 13, 2023 •

edited

Loading