Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frictionless silently adds columns not defined in the schema (add schema_sync) #150

Open
peterdesmet opened this issue Sep 13, 2023 · 0 comments
Labels
enhancement New feature or request function:read_resource Function read_resource()
Milestone

Comments

@peterdesmet
Copy link
Member

peterdesmet commented Sep 13, 2023

example.zip contains data.csv with 3 columns. Only the first two are defined in datapackage.json.

  • name (string)
  • identifier (integer)
  • include (boolean)

Frictionless will silently include the extra column include and name it X3:

> p <- read_package("example/datapackage.json")
Please make sure you have the right to access data from this Data Package for your intended use.
Follow applicable norms or requirements to credit the dataset and its authors.
> d <- read_resource(p, "data")                                                                                                                      
> d
# A tibble: 4 × 3
  name     identifier X3   
  <chr>         <dbl> <lgl>
1 oconnell          1 TRUE 
2 rovero            2 TRUE 
3 cadman            3 FALSE
4 burton            4 FALSE

However, if the extra column is not in the middle (name, include, identifier, see example2.zip), frictionless will throw a parsing issue for the second column, and still add the last column:

> p <- read_package("example/datapackage.json")
Please make sure you have the right to access data from this Data Package for your intended use.
Follow applicable norms or requirements to credit the dataset and its authors.
> d <- read_resource(p, "data")
Warning message:                                                                                                                      
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat) 
> d
# A tibble: 4 × 3
  name     identifier    X3
  <chr>         <dbl> <dbl>
1 oconnell         NA     1
2 rovero           NA     2
3 cadman           NA     3
4 burton           NA     4

A col_select does not circumvent this issue:

> d <- read_resource(p, "data", col_select = c("name", "identifier"))
Warning message:                                                                                                                      
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat) 
> d
# A tibble: 4 × 2
  name     identifier
  <chr>         <dbl>
1 oconnell         NA
2 rovero           NA
3 cadman           NA
4 burton           NA

I think this behaviour should be documented better. Ideally, we have a schema_sync parameter in read_resource() which compares column headers with the schema (#127) and:

If false (default):

  • Warn for name mismatch
  • Error for order mismatch
  • Error for extra columns (different than current behavior)

If true:

  • Return all columns, in the order of the header in the data (cf. Frictionless Framework)
  • Applies type, enum, etc. if a matching column is found in the schema
  • Guesses type if column is not defined in the schema

schema_sync = true then allows more loose reading of data, something that would be beneficial to e.g. the bioRad package

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request function:read_resource Function read_resource()
Projects
None yet
Development

No branches or pull requests

1 participant