Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor filters in spawn_senders #829

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

Asha20
Copy link
Contributor

@Asha20 Asha20 commented Aug 12, 2021

Relates to #382

I refactored all of the filters that don't require metadata, introducing each new filter in a separate commit. I'm planning to tackle the metadata filters later, but this might be a good stopping point for feedback before proceeding.

@Asha20
Copy link
Contributor Author

Asha20 commented Aug 14, 2021

@sharkdp Should I keep working on this PR?

src/walk.rs Outdated Show resolved Hide resolved
@sharkdp
Copy link
Owner

sharkdp commented Nov 14, 2021

How do we continue with this? I still think this could be a very valuable change to the code base. I would like to be very careful though and properly check with benchmarks that this doesn't introduce any performance regressions.

Sorry for leaving this hanging for so long, requiring (you) to fix a few merge conflicts.

@Asha20
Copy link
Contributor Author

Asha20 commented Nov 14, 2021

It's okay, you did mention you were pretty busy in the Reddit post asking for contributors.

I fixed the merge conflicts, refactored the code a tiny bit and then ran regression.sh from the fd-benchmarks repo. I built the master binary from the latest commit on master (this one at the time of writing) and the feature binary from this commit.

Benchmark report file

fd regression benchmark

No pattern

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master --hidden --no-ignore '' '/home/user' 343.1 ± 12.1 327.2 363.3 1.00
./fd-feature --hidden --no-ignore '' '/home/user' 367.1 ± 6.9 358.4 378.6 1.07 ± 0.04

Simple pattern

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master '.*[0-9]\.jpg$' '/home/user' 22.6 ± 2.2 18.6 28.4 1.00
./fd-feature '.*[0-9]\.jpg$' '/home/user' 22.7 ± 2.4 17.9 31.2 1.00 ± 0.15

Simple pattern (-HI)

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI '.*[0-9]\.jpg$' '/home/user' 264.9 ± 9.4 254.4 285.5 1.00
./fd-feature -HI '.*[0-9]\.jpg$' '/home/user' 287.3 ± 9.0 274.4 303.0 1.08 ± 0.05

File extension

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI --extension jpg '' '/home/user' 282.5 ± 8.8 273.7 295.5 1.00
./fd-feature -HI --extension jpg '' '/home/user' 475.7 ± 16.3 458.7 501.9 1.68 ± 0.08

File type

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI --type l '' '/home/user' 261.3 ± 7.0 252.6 275.5 1.00
./fd-feature -HI --type l '' '/home/user' 291.1 ± 9.0 280.2 310.6 1.11 ± 0.05

Cold cache

Command Mean [s] Min [s] Max [s] Relative
./fd-master -HI '.*[0-9]\.jpg$' '/home/user' 79.630 ± 1.315 78.491 81.069 1.00
./fd-feature -HI '.*[0-9]\.jpg$' '/home/user' 81.093 ± 1.952 79.208 83.105 1.02 ± 0.03

My guess is that the massive performance regression in the File Extension test probably comes from this line, where I clone an Option<RegexSet>:

Box::new(Extensions::new(config.extensions.clone()))

I assume the Extensions filter could be easily modified to use Option<&RegexSet> instead. Let me know if the benchmarks look okay to you, assuming this outlier is fixed.

@sharkdp
Copy link
Owner

sharkdp commented Nov 14, 2021

Thank you very much for bringing this up to date!

Unfortunately, the benchmark results do not look okay, even excluding the --extension one. There is a 7-8% performance hit for the "No pattern" and "Simple pattern (-HI)" benchmarks, which I would not deem acceptable for a refactoring-only change. Let's try to figure out what is causing the regression. I would hope that this can be fixed somehow.

src/walk.rs Outdated Show resolved Hide resolved
src/filter/filetypes.rs Outdated Show resolved Hide resolved
@Asha20
Copy link
Contributor Author

Asha20 commented Nov 14, 2021

It's a lot better now! Here are the new benchmark results:

Benchmark report file

fd regression benchmark

No pattern

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master --hidden --no-ignore '' '/home/asha' 303.4 ± 10.6 286.0 315.4 1.05 ± 0.05
./fd-feature --hidden --no-ignore '' '/home/asha' 288.3 ± 8.7 272.4 299.9 1.00

Simple pattern

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master '.*[0-9]\.jpg$' '/home/asha' 19.0 ± 1.3 16.5 23.1 1.00 ± 0.11
./fd-feature '.*[0-9]\.jpg$' '/home/asha' 19.0 ± 1.6 14.2 23.6 1.00

Simple pattern (-HI)

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI '.*[0-9]\.jpg$' '/home/asha' 210.2 ± 2.2 206.1 216.1 1.00
./fd-feature -HI '.*[0-9]\.jpg$' '/home/asha' 211.5 ± 4.8 206.4 223.2 1.01 ± 0.03

File extension

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI --extension jpg '' '/home/asha' 222.5 ± 1.7 220.5 225.7 1.00
./fd-feature -HI --extension jpg '' '/home/asha' 222.8 ± 5.0 216.6 233.7 1.00 ± 0.02

File type

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI --type l '' '/home/asha' 208.0 ± 6.1 200.7 218.9 1.00
./fd-feature -HI --type l '' '/home/asha' 208.7 ± 3.7 203.8 215.9 1.00 ± 0.03

Cold cache

Command Mean [s] Min [s] Max [s] Relative
./fd-master -HI '.*[0-9]\.jpg$' '/home/asha' 86.769 ± 1.437 85.202 88.022 1.02 ± 0.02
./fd-feature -HI '.*[0-9]\.jpg$' '/home/asha' 85.200 ± 1.499 83.473 86.163 1.00

@sharkdp
Copy link
Owner

sharkdp commented Nov 15, 2021

It's a lot better now!

Cool! Thank you for the update. Is the "Switch file filter dynamic dispatch to enum" commit essential to get these improved runtimes? I liked the previous implementation better. And I can't imagine that the dynamic dispatch is that expensive?

@Asha20
Copy link
Contributor Author

Asha20 commented Nov 15, 2021

I like the previous implementation more too honestly, though it does seem to have a noticeable impact; you can be the judge of whether it's acceptable or not. Here's a benchmark of the commit right before replacing the dynamic dispatch with the enum:

Benchmark report file

fd regression benchmark

No pattern

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master --hidden --no-ignore '' '/home/asha' 290.3 ± 8.5 282.2 306.3 1.00
./fd-feature --hidden --no-ignore '' '/home/asha' 301.6 ± 8.8 284.0 317.6 1.04 ± 0.04

Simple pattern

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master '.*[0-9]\.jpg$' '/home/asha' 18.7 ± 1.6 14.7 22.8 1.00
./fd-feature '.*[0-9]\.jpg$' '/home/asha' 19.1 ± 1.4 16.9 24.9 1.03 ± 0.12

Simple pattern (-HI)

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI '.*[0-9]\.jpg$' '/home/asha' 208.1 ± 3.6 204.2 218.5 1.00
./fd-feature -HI '.*[0-9]\.jpg$' '/home/asha' 208.6 ± 4.4 203.1 218.2 1.00 ± 0.03

File extension

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI --extension jpg '' '/home/asha' 222.3 ± 4.5 216.4 235.2 1.00
./fd-feature -HI --extension jpg '' '/home/asha' 223.8 ± 6.7 217.9 238.5 1.01 ± 0.04

File type

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI --type l '' '/home/asha' 207.5 ± 4.6 202.7 218.7 1.00
./fd-feature -HI --type l '' '/home/asha' 209.3 ± 5.9 202.3 222.8 1.01 ± 0.04

Cold cache

Command Mean [s] Min [s] Max [s] Relative
./fd-master -HI '.*[0-9]\.jpg$' '/home/asha' 85.896 ± 1.179 84.573 86.836 1.00 ± 0.02
./fd-feature -HI '.*[0-9]\.jpg$' '/home/asha' 85.639 ± 0.604 84.953 86.091 1.00

Regarding the enum option, I'm guessing you mostly have a problem with the enum-name-struct-name repetition, right? I'm talking about this:

let mut filters: Vec<FilterKind> = vec![
    FilterKind::SkipRoot(SkipRoot),
    FilterKind::MinDepth(MinDepth::new(config.min_depth)),
    FilterKind::RegexMatch(RegexMatch::new(pattern, config.search_full_path)),
    FilterKind::Extensions(Extensions::new(config.extensions.as_ref())),
];

If that's the case, one potential improvement could be to have the structs return the relevant FilterKind variant themselves, something like this:

let mut filters = vec![
    SkipRoot::make_filter(),
    MinDepth::make_filter(config.min_depth),
    RegexMatch::make_filter(pattern, config.search_full_path),
    Extensions::make_filter(config.extensions.as_ref()),
];

@sharkdp
Copy link
Owner

sharkdp commented Nov 16, 2021

Regarding the enum option, I'm guessing you mostly have a problem with the enum-name-struct-name repetition, right? I'm talking about this:

Hm... not this duplication per se, just the general amount of additional code.

@sharkdp
Copy link
Owner

sharkdp commented Nov 16, 2021

I like the previous implementation more too honestly, though it does seem to have a noticeable impact; you can be the judge of whether it's acceptable or not. Here's a benchmark of the commit right before replacing the dynamic dispatch with the enum:

Very interesting. I wouldn't have expected this. This might have to do with inlining actually. If we use static dispatch, the compiler might be able to inline to filter implementations?

@tavianator
Copy link
Collaborator

Can we try something like this?

struct AllFilters {
    skip_root: SkipRoot,
    min_depth: MinDepth,
    ...
}

impl Filter for AllFilters {
    fn should_skip(&self, entry: &DirEntry) -> bool {
        self.skip_root.should_skip(entry)
            || self.min_depth.should_skip(entry)
            || ...
    }
}

For the FileTypes filter, either use Option<FileTypes> or just make a no-op FileTypes.

@sharkdp
Copy link
Owner

sharkdp commented Nov 16, 2021

Can we try something like this?

Doesn't that defeat the whole purpose? I mean, it would still move code out to multiple functions. But we couldn't simply call:

let should_skip = filters.iter().any(|filter| filter.should_skip(&entry));

and would have to list every filter manually. The Filter trait would be pretty useless. We could also just directly call sth. like:

let should_skip = min_depth_should_skip() || file_type_should_skip() || …;

from the main code path.

The enum version isn't super pretty, but it's pretty much what one would do with std::variant and std::visit in C++ as well to avoid the dynamic dispatch, I believe.

@Asha20
Copy link
Contributor Author

Asha20 commented Nov 16, 2021

Should the code stay as it is then? The metadata filters are next up for similar refactoring as per the first comment of the issue.

@yyogo
Copy link
Contributor

yyogo commented Nov 17, 2021

How about something like:

pub trait Filter: Send + Sync + Sized {
    /// Whether the entry should be skipped or not.
    fn should_skip(&self, entry: &DirEntry) -> bool;
    
    fn chain<F: Filter>(self, other: F) -> ChainedFilter<Self, F> {
        ChainedFilter(self, other)
    }
}

pub struct ChainedFilter<F1: Filter, F2: Filter>(F1, F2);

impl<F1: Filter, F2: Filter> Filter for ChainedFilter<F1, F2> {
    
    fn should_skip(&self, entry: &DirEntry) -> bool {
        self.0.should_filter(entry) || self.1.should_filter(entry)
}

// then in walk:

let combined_filter = SkipRoot
    .chain(MinDepth::new(config.min_depth))
    .chain(RegexMatch::new(pattern, config.search_full_path))
   //...

This keeps the filters static but removes the need for any dispatch

@tavianator
Copy link
Collaborator

@yyogo Iterator style, I like it

@Asha20
Copy link
Contributor Author

Asha20 commented Nov 22, 2021

How would you handle optional filters with that approach?

let filters = {
    let filters = SkipRoot
        .chain(MinDepth::new(config.min_depth))
        .chain(RegexMatch::new(pattern, config.search_full_path))
        .chain(Extensions::new(config.extensions.as_ref()));

    if let Some(file_types) = config.file_types.clone() {
        filters.chain(file_types)
    } else {
        filters
    }
};

This is a no-go since you're returning different types inside if and else. If you go for this:

let filters = SkipRoot
    .chain(MinDepth::new(config.min_depth))
    .chain(RegexMatch::new(pattern, config.search_full_path))
    .chain(Extensions::new(config.extensions.as_ref()))
    .chain_opt(config.file_types.clone());

And try to implement chain_opt:

fn chain_opt<F: Filter>(self, other: Option<F>) -> ChainedFilter<Self, F> {
    match other {
        Some(other) => self.chain(other),
        None => ???,
    }
}

What should the None case return? You can't return self since it's not a ChainedFilter, and you can't chain some kind of NeverSkip filter since it's a different type from F.

@yyogo
Copy link
Contributor

yyogo commented Nov 22, 2021

@Asha20 maybe this?

impl<F> Filter for Option<F> where F: Filter { 
    fn should_skip(&self, entry: &DirEntry) -> bool {
        self.as_ref().map_or(false, |f| f.should_skip(entry)) 
    } 
}

alternatively config.file_types could take a default instead of being an Option, but that sounds bad for performance when it's not needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants