Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

after tokenizing with trained tokenizer, the "tokens" array contains original tokens #166

Open
theglassofwater opened this issue May 9, 2024 · 2 comments

Comments

@theglassofwater
Copy link

after tokenizing a song with a trained tokenizer, the "tokens" array contains only the base tokens, the "ids" array is fine containing newly generated vocab, i was wondering if this was design choice or bug

@Natooz
Copy link
Owner

Natooz commented May 9, 2024

Hi,
This is a design choice (i.e. to only alter the ids) as the main purpose of encoding the sequence is to fed the ids to a model.
If you really need to explore what encoded ids are made of, you can always use the vocabulary dictionaries to convert the encoded ids
https://github.com/Natooz/MidiTok/blob/main/miditok/midi_tokenizer.py#L111

Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added stale Inactive since 30 days or more and removed stale Inactive since 30 days or more labels May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants