Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update caching and add tokenizer to create_states_mapping #911

Merged

Conversation

brandonwillard
Copy link
Contributor

@brandonwillard brandonwillard commented May 21, 2024

This PR removes the use of hash values as the keys to the underlying function call cache. It also adds the tokenizer to the arguments of create_states_mapping.

The main change here is that hash_arguments has been removed and replaced with a Disk subclass, CloudpickleDisk, that uses cloudpickle. This allows all the same objects to be cacheable as before, while also removing the use of hash values as keys. The latter is especially important, since hash collisions at the key level can return the wrong results from the cache.

Likewise, the key_function argument to cache has been removed. If one needs this sort of functionality, they can wrap the corresponding arguments with objects that properly implement serialization.

Closes #872

@brandonwillard brandonwillard marked this pull request as draft May 21, 2024 23:41
@brandonwillard brandonwillard force-pushed the use-persistent-tokenizer-hash branch 2 times, most recently from 8136133 to 5d795cf Compare May 22, 2024 00:03
@brandonwillard brandonwillard marked this pull request as ready for review May 22, 2024 00:04
@brandonwillard brandonwillard requested a review from rlouf May 22, 2024 00:04
@brandonwillard brandonwillard added transformers Linked to the `transformers` integration structured generation Linked to structured generation correctness Everything related to the generation correctness labels May 22, 2024
@brandonwillard brandonwillard marked this pull request as draft May 22, 2024 00:12
@brandonwillard brandonwillard changed the title Use a persistent Tokenizer hash and add tokenizer to cache signature Use a persistent tokenizer hash for create_states_mapping caching May 22, 2024
@brandonwillard brandonwillard changed the title Use a persistent tokenizer hash for create_states_mapping caching Update caching and add tokenizer to create_states_mapping May 22, 2024
@brandonwillard brandonwillard force-pushed the use-persistent-tokenizer-hash branch 3 times, most recently from 917e2c9 to 3401637 Compare May 22, 2024 21:50
@brandonwillard brandonwillard marked this pull request as ready for review May 22, 2024 21:51
@brandonwillard brandonwillard requested a review from rlouf May 22, 2024 22:00
These changes make the `cache` decorator operate more like `diskcache`'s
existing `memoize` method.  They also remove the use of hash value as store
keys.
@brandonwillard brandonwillard merged commit ba7affd into outlines-dev:main May 23, 2024
5 checks passed
@brandonwillard brandonwillard deleted the use-persistent-tokenizer-hash branch May 23, 2024 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug correctness Everything related to the generation correctness structured generation Linked to structured generation transformers Linked to the `transformers` integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

State mapping cache ignores the tokenizer used to build the state machine
2 participants