Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tighten rule pre-selection #2080

Draft
wants to merge 21 commits into
base: feat/1755
Choose a base branch
from
Draft

Conversation

williballenthin
Copy link
Collaborator

@williballenthin williballenthin commented May 14, 2024

closes #2074
ref #2063, particularly "tighten rule pre-selection" and "lots of time spent in instancecheck"

Stacked on #1950, so I've marked this as a PR onto that branch so the diff is sensible. I think we can probably rebase onto master, though, if necessary.


This PR implements the "tighten rule pre-selection" algorithm described here: #2063 (comment) . In summary:

Rather than indexing all features from all rules, we should pick and index the minimal set (ideally, one) of features from each rule that must be present for the rule to match. When we have multiple candidates, pick the feature that is probably most uncommon and therefore "selective".

This seems to work pretty well. Total evaluations when running against mimikatz drop from 19M to 815K (wow!) and capa seems to match around 3x more functions per second (wow wow). I did not expect such a good result - in fact, although the capa matches seem the be the same, I still wonder if something is broken 🤔. More tests needed.

label count(evaluations) min(time) avg(time) max(time)
8858537 pep8 19,939,632 25.74s 25.80s 25.84s
9c0c662 rules: optimize rule pre-filtering, first revision 815,892 9.45s 9.46s 9.48s

TODO:

  • namespace matching
  • prove that it matches exactly the same as before, just faster
  • add some tests for the feature indexer, if only to show a human how it works
  • add matcher tests for namespace matching
  • xfail the tests and document the unsupported constructs
  • inline documentation explaining the algorithm better
  • wall clock performance numbers

@williballenthin williballenthin added the enhancement New feature or request label May 14, 2024
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

Comment on lines +1389 to +1415
@property
def file_rules(self):
return self.rules_by_scope[Scope.FILE]

@property
def process_rules(self):
return self.rules_by_scope[Scope.PROCESS]

@property
def thread_rules(self):
return self.rules_by_scope[Scope.THREAD]

@property
def call_rules(self):
return self.rules_by_scope[Scope.CALL]

@property
def function_rules(self):
return self.rules_by_scope[Scope.FUNCTION]

@property
def basic_block_rules(self):
return self.rules_by_scope[Scope.BASIC_BLOCK]

@property
def instruction_rules(self):
return self.rules_by_scope[Scope.INSTRUCTION]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for backwards compatibility. during a major version, we can probably remove these with preference to rules_by_scope.

@williballenthin
Copy link
Collaborator Author

Opened the PR here so the code is no longer sitting on my laptop and at risk of getting lost due to hardware failure.

Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, we should do extensive tests comparing the results before and after to ensure everything works as expected. the speedup looks promising!

capa/rules/__init__.py Outdated Show resolved Hide resolved
capa/rules/__init__.py Outdated Show resolved Hide resolved
@williballenthin
Copy link
Collaborator Author

williballenthin commented May 16, 2024

we should do extensive tests comparing the results before and after to ensure everything works as expected.

I plan to run this implementation side by side with the ceng.match implementation and assert the results are precisely the same across a wide range of samples. There should be no leaks of abstraction or details in the new one, it should just be faster.

@github-actions github-actions bot dismissed their stale review May 22, 2024 13:23

CHANGELOG updated or no update needed, thanks! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants