Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Reduce Boilerplate to Silence Many-to-Many Join Warnings in dplyr #6993

Open
ryandward opened this issue Feb 23, 2024 · 4 comments

Comments

@ryandward
Copy link

Brief description of the problem:

Conducting exploratory (i.e. not production-environment) data analysis often requires multiple many-to-many joins. dplyr's current behavior emits warnings for each join when not explicitly specifying relationship = "many-to-many" and significantly clutter the console output.

This behavior introduces verbosity into the exploratory analysis process, where "many-to-many" relationships are expected, anticipated, and managed. The repeated need to specify the relationship parameter for each join operation to avoid these warnings is cumbersome and detracts from the efficiency to use dplyr as an exploration tool.

An option to globally silence these warnings would streamline exploratory data analyses, allowing for a focus on substantive inquiry and result interpretation.

Desired output:

A global option in dplyr to silence warnings for many-to-many joins, enhancing experience by reducing repetitive boilerplate and focusing on relevant outputs.

Reprex:

library(dplyr)
 
    # Example datasets demonstrating the necessity of chained inner_joins
    df1 <- data.frame(key = c(1, 2, 2, 3), value = c("A", "B", "C", "D"))
    df2 <- data.frame(key = c(2, 2, 3, 3, 4), value2 = c("W", "X", "Y", "Z", "P"))
    df3 <- data.frame(key = c(2, 3, 3, 4, 5), value3 = c("K", "L", "M", "N", "O"))
 
    # Chained inner_join operations common in exploratory analysis
    df1 %>%
      inner_join(df2, by = "key") %>%
      inner_join(df3, by = "key")

  key value value2 value3
1   2     B      W      K
2   2     B      X      K
3   2     C      W      K
4   2     C      X      K
5   3     D      Y      L
6   3     D      Y      M
7   3     D      Z      L
8   3     D      Z      M
Warning messages:
1: In inner_join(., df2, by = "key") :
  Detected an unexpected many-to-many relationship between `x` and `y`.Row 2 of `x` matches multiple rows in `y`.Row 1 of `y` matches multiple rows in `x`.If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.
2: In inner_join(., df3, by = "key") :
  Detected an unexpected many-to-many relationship between `x` and `y`.Row 5 of `x` matches multiple rows in `y`.Row 1 of `y` matches multiple rows in `x`.If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.

In the above reprex, joining the objects have duplicate keys results in a many-to-many join, triggering a warning, which is sometimes longer than the output. The proposed feature would allow for operations without the need to explicitly suppress warnings for each operation, assuming a global setting had been enabled.

Hypothetical implementation 1

Introduce a much faster to type alias, i.e. "rel" for "relationship" and "mm" for "many-to-many", with other options for other relationships.

    df1 %>%
    inner_join(df2, by = "key", rel = 'mm') %>%
    inner_join(df3, by = "key", rel = 'mm') 

Hypothetical implementation 2

Introduce an option to silence these warnings for a session

options(dplyr.silence_many_to_many_warnings = TRUE)
    df1 %>%
    inner_join(df2, by = "key") %>%
    inner_join(df3, by = "key") 
@ryandward
Copy link
Author

We gotta do something about this. I must reiterate how incredibly frustrating and hampering this is to exploratory research.
image

@philibe
Copy link

philibe commented Mar 21, 2024

I'm a user of dplyr. IMHO you're right. It's discussed in this closed issue:

@hadley
Copy link
Member

hadley commented Apr 1, 2024

We realise that this behaviour might be annoying for folks who really depend a lot on many to many joins, but our experience is that these warnings are useful for the overwhelming majority of dplyr users, and we don't have any intention to change the default behaviour at this time.

If you find this really frustrating, I'd suggest making a couple of your own little helpers like this:

inner_mm_join <- function(...) inner_join(..., relationship = "many-to-many")
left_mm_join <- function(...) left_join(..., relationship = "many-to-many")

That will reduce the amount of typing you need to do while still keeping your code compact and easy to understand.

@ryandward
Copy link
Author

Thanks @hadley this is an acceptable solution, but I still do not love it. I am just curious if you have any references to point to the overwhelming majority you're referencing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants