Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support configurable summary() #624

Open
vorpalvorpal opened this issue Nov 2, 2020 · 8 comments
Open

Support configurable summary() #624

vorpalvorpal opened this issue Nov 2, 2020 · 8 comments

Comments

@vorpalvorpal
Copy link

Because I am fairly incompetent, I seem to keep introducing duplicate rows into my data frames. I was wonder if, in the initial data summary bit of the output of skim(), "duplicate rows" might be a useful additional metric.

@elinw
Copy link
Collaborator

elinw commented Nov 4, 2020

Skimr is pretty column oriented and you're asking something row oriented. That said I think that sum(duplicated(x)) would give that number. Of course in many data sets it is expected that there will be repeats.

@michaelquinn32
Copy link
Collaborator

I think we can go a bit further. The most useful place for this would be in the summary, i.e.

skimr/R/summary.R

Lines 12 to 14 in 22dfec2

summary.skim_df <- function(object, ...) {
if (is.null(object)) {
stop("dataframe is null.")

I think the implementation depends on how far we should push this.

  • Should the summary function be customizable with an sfl?
  • How would that impact printing the summary?
  • Are we confident that only minimal number of functions would ever be needed there?

@elinw
Copy link
Collaborator

elinw commented Nov 5, 2020

I was thinking the same thing, i.e. should we make it customizable because this might be the first of many requests to add things. I do think that for our user scenario of "someone gives you a data set and you're trying to understand it" it might be very useful. If there are a lot of duplicates it might be smart to store it in a way that reflects that.

@elinw
Copy link
Collaborator

elinw commented Jan 1, 2022

@michaelquinn32 if we are fixing issues on summary we could think about this one.

@michaelquinn32
Copy link
Collaborator

michaelquinn32 commented Jan 1, 2022

This is a little more than the current updates to the summary(), since we'll need to modify the skim object to store this information. I can get to it soon.

@elinw
Copy link
Collaborator

elinw commented Jan 4, 2022

What I was thinking is that eventually when we have a more flexible summary that would really allow a user to do this.

@michaelquinn32
Copy link
Collaborator

Could put this on the roadmap too.

Right now, the issue is that we generate all of the summary components as skimr attributes, which we then extract in the summary function.

For a 3.0, we could extend skim_with() to provide a custom summary function. We could store the result of this as a single attribute in the skim_df, and we might consider a custom print handling function (like in #667) or maybe we can simplify the output.

gt() handles grouping variables.
http://www.danieldsjoberg.com/gt-and-gtsummary-presentation/#11

So we could require a summary function to produce

[stat group type] [stat name] [value]

Which should give a value that is pretty similar to we currently generate.

You could even think of a summary interface that is similar to skimr, basically using sfl`s.

my_skim <- skim_with(
  .summary = skimr_summary_fun(
    metatadata = sfl(
      name = get_data_name,
      group_variables = dplyr::groups
    ),
    counts = sfl(
      number_of_rows = nrow,
      number_of_columns = length,
      number_of_duplicate_rows = ~ sum(duplicated(.))
    ),
    .include_column_types = TRUE
  )
)

The last part is set as a function argument, since counting column types is something we currently do on the skim_df result. The other option there would be to support name = function() values, where function returns something that can be coerced into stat name - value pairs, and the name value becomes the name for the group. That's a lot more flexible, and probably could support summary functions that tell you which columns are most similar or something like that.

What do you think?

@michaelquinn32 michaelquinn32 changed the title [feature request] Duplicate rows Support configurable summary() Jan 18, 2022
@elinw
Copy link
Collaborator

elinw commented Jan 2, 2023

I just reread this and yes I really think that an sfl for summary would be the way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants