after tokenizing with trained tokenizer, the "tokens" array contains original tokens #166

theglassofwater · 2024-05-09T10:10:51Z

after tokenizing a song with a trained tokenizer, the "tokens" array contains only the base tokens, the "ids" array is fine containing newly generated vocab, i was wondering if this was design choice or bug

Natooz · 2024-05-09T10:38:52Z

Hi,
This is a design choice (i.e. to only alter the ids) as the main purpose of encoding the sequence is to fed the ids to a model.
If you really need to explore what encoded ids are made of, you can always use the vocabulary dictionaries to convert the encoded ids
https://github.com/Natooz/MidiTok/blob/main/miditok/midi_tokenizer.py#L111

github-actions · 2024-05-31T02:09:37Z

This issue is stale because it has been open for 30 days with no activity.

github-actions bot added stale Inactive since 30 days or more and removed stale Inactive since 30 days or more labels May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

after tokenizing with trained tokenizer, the "tokens" array contains original tokens #166

after tokenizing with trained tokenizer, the "tokens" array contains original tokens #166

theglassofwater commented May 9, 2024

Natooz commented May 9, 2024

github-actions bot commented May 31, 2024

after tokenizing with trained tokenizer, the "tokens" array contains original tokens #166

after tokenizing with trained tokenizer, the "tokens" array contains original tokens #166

Comments

theglassofwater commented May 9, 2024

Natooz commented May 9, 2024

github-actions bot commented May 31, 2024