Apply functools.lru_cache
to RegexFSM
to Improve CFGFSM
performance
#621
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Improve performance of
lark_lark_self_grammar.lark.test-True
(tests-True
meaning the cache already is populated, simulating a second run.) 16 tokens / second -> 39 tokens / second on second run.(Profiled with #587)
Addresses #620
Problem
The vast majority of the time in the second run of a
CFGFSM
is spent retrieving cachedRegexFSM
RegexFSM.__init__
(called 2772 times): 144 secondsspecifically the following operations are slow:
tokenizer.vocabulary
for hashingSolution
To alleviate this,
RegexFSM
is wrapped withlru_cache()
. This operation is safe because we never actually mutateRegexFSM
.This simple change nearly triples throughput, decreasing runtime of
lark_self_grammar.lark.test
's second run from 179 second to 71 seconds.Initial performance:
Profile:
This PRs performance:
Profile: