Releases: Natooz/MidiTok
v3.0.3 Training with WordPiece and Unigram + abc files support
Highlights
- Support for abc files, which can be loaded and dumped with symusic similarly to MIDI files;
- The tokenizers can now also be trained with the WordPiece and Unigram algorithms!
- Tokenizer training and token ids encoding can now be performed "bar-wise" or "beat-wise", meaning the tokenizer can learn new tokens from successions of base tokens strictly within bars or beats. This is set by the
encode_ids_split
attribute of the tokenizer config; - symusic v0.4.3 or higher is now required to comply with the usage of the
clip
method; - Better handling of file loading errors in
DatasetMIDI
andDataCollator
; - Introducing a new
filter_dataset
to clean a dataset of MIDI/abc files before using it; MMM
tokenizer has been cleaned up, and is now fully modular: it now works on top of other tokenizations (REMI
,TSD
andMIDILike
) to allow more flexibility and interoperability;TokSequence
objects can now be sliced and concatenated (egseq3 = seq1[:50] + seq2[50:]
);TokSequence
objects tokenized from a tokenizer can now be split per bars or beats subsequences;- minor fixes, code improvements and cleaning;
Methods renaming
A few methods and properties were previously named after "bpe" and "midi". To align with the more general usages of these methods (support for several file formats and training algorithms), they have been renamed with more idiomatic and accurate names.
Methods renamed with depreciation warning:
midi_to_tokens
-->encode
;tokens_to_midi
-->decode
;learn_bpe
-->train
;apply_bpe
-->encode_token_ids
;decode_bpe
-->decode_token_ids
;ids_bpe_encoded
-->are_ids_encoded
;vocab_bpe
-->vocab_model
.tokenize_midi_dataset
-->tokenize_dataset
;
Methods renamed without depreciation warning (less usages, reduces the code messiness):
MIDITokenizer
-->MusicTokenizer
;augment_midi
-->augment_score
;augment_midi_dataset
-->augment_dataset
;augment_midi_multiple_offsets
-->augment_score_multiple_offsets
;split_midis_for_training
-->split_files_for_training
;split_midi_per_note_density
-->split_score_per_note_density
;get_midi_programs
-->get_score_programs
;merge_midis
-->merge_scores
;get_midi_ticks_per_beat
-->get_score_ticks_per_beat
;split_midi_per_ticks
-->split_score_per_ticks
;split_midi_per_beats
-->split_score_per_beats
;split_midi_per_tracks
-->split_score_per_tracks
;concat_midis
-->concat_scores
;
Protected internal methods (no depreciation warning, advanced usages):
MIDITokenizer._tokens_to_midi
-->MusicTokenizer._tokens_to_score
;MIDITokenizer._midi_to_tokens
-->MusicTokenizer._score_to_tokens
;MIDITokenizer._create_midi_events
-->MusicTokenizer._create_global_events
There is no other compatibility issue beside these renaming.
Full Changelog: v3.0.2...v3.0.3
v3.0.2 New data loading and preprocessing methods
Tldr
This new version introduces a new DatasetMIDI
class to use when training PyTorch models. It relies on the previously named DatasetTok
class, with pre-tokenizing option and better handling of BOS and EOS tokens.
A new miditok.pytorch_data.split_midis_for_training
method allows to dynamically chunk MIDIs into smaller parts that make approximately the desire token sequence length, based on the note densities of their bars. These chunks can be used to train a model while maximizing the overall amount of data used.
A few new utils methods have been created for this features, e.g. to split, concat or merge symusic.Score
objects.
Thanks @Kinyugo for the discussions and tests that guided the development of the features! (#147)
The update also brings a few minor fixes, and the docs have a new theme!
What's Changed
- Fix token_paths to files_paths, and config to model_config by @sunsetsobserver in #145
- Fix issues in Octuple with multiple different-beat time signatures by @ilya16 in #146
- Pitch interval decoding: discarding notes outside the tokenizer pitch range by @Natooz in #149
- Fixing
save_pretrained
to comply with huggingface_hub v0.21 by @Natooz in #150 - ability to
overwrite _create_durations_tuples
in init by @JLenzy in #153 - Refactor of PyTorch data loading classes and methods by @Natooz and @Kinyugo in #148
- The docs have a new theme! Using the furo theme.
New Contributors
- @sunsetsobserver made their first contribution in #145
- @JLenzy made their first contribution in #153
Full Changelog: v3.0.1...v3.0.2
V3.0.1 PitchDrum and minor fixes
What's Changed
use_pitchdrum_tokens
option to use dedicatedPitchDrum
tokens for drums tracks- Fixing time signature preprocessing (time division mismatch) in #132 (#131 @EterDelta)
- Fixing data augmentation example and considering all midi extensions in #136 (#135 @oiabtt)
- decoding: automatically making sure to decode BPE then completing
tokens
in #138 (#137 @oiabtt) load_tokens
now returningTokSequence
by in #139 (#137 @oiabtt)- convert chord maps back to tuples from list when loading tokenizer from a saved configuration by @shenranwang in #141
- can now use
MIDITokenizer.from_pretrained
similarly to theAutoTokenizer
in the Hugging Face transformers library by in #142 (discussed in #127 @oiabtt)
New Contributors
- @shenranwang made their first contribution in #141
Full Changelog: v3.0.0...v3.0.1
V3.0 Switch to Symusic - performance boost
Switch to symusic
This major version marks the switch from the miditoolkit MIDI reading/writing library to symusic, and a large optimisation of the MIDI preprocessing steps.
Symusic is a MIDI reading / writing library written in C++ with Python binding, offering unmatched speeds, up to 500 times faster than native Python libraries. It is based on minimidi. The two libraries are created and maintained by @Yikai-Liao and @lzqlzzq, who did an amazing work, which is still ongoing as many useful features are on the roadmap! 馃
Tokenizers from previous versions are compatible with this new version, but their might be some time variations if you compare how MIDIs are tokenized and tokens decoded.
Performance boost
These changes result in a way faster MIDI loading/writing and tokenization times! The overall tokenization (loading MIDI and tokenizing it) is between 5 to 12 times faster depending the tokenizer and data. You can find other benchmarks here.
This huge speed gain allows to discard the previously recommended step of pre-tokenizing MIDI files as json tokens, and directly tokenize the MIDIs on the fly while training/using a model! We updated the usage examples of the docs accordingly, the code is now simplified.
Other major changes
- When using time signatures, time tokens are now computed in ticks per beat, as opposed to ticks per quarter note as done previously. This change is in line with the definition of time and duration tokens, which was not handled following the MIDI norm for note values other than the quarter note until now (#124);
- Adding new ruff rules and their fixes to comply, increasing the code quality in #115;
- MidiTok still supports
miditoolkit.MidiFile
objects, but those will be converted on the fly to asymusic.Score
object and a depreciation warning will be thrown; - The data augmentation methods on the token level has been removed, in favour of better data augmentation operating directly on MIDIs, now much faster, simplifying processes and now handling durations;
- The docs are fixed;
- The tokenization tests workflows has been unified and considerably simplified, leading to more robust test assertions. We also increased the number of test cases and configurations, while decreasing the test time.
Other minor changes
- Setting special tokens values in TokenizerConf in #114
- Update README.md by @kalyani2003 in #120
- Readthedocs preview action for PRs in #125
New Contributors
- @kalyani2003 made their first contribution in #120
Full Changelog: v2.1.8...v3.0.0
v2.1.8 Pitch Intervals & minor fixes
This new version brings a new additional token type: pitch intervals. It allows to represent pitch intervals for simultaneous and successive note. You can read more details about how it works in the docs.
We greatly improved the tests and Ci workflow, and fixed a few minor bugs and improvements along the way.
This new version also drops support for Python 3.7, and now requires Python 3.8 and newer. You can read more about the decision and how to make it retro-compatible in the docs.
We encourage you to update to the latest miditoolkit version, which also features some fixes and improvements. The most notable one is a clean of the dependencies, and compatibility with recent numpy versions!
What's Changed
- Typos fixes in docs by @eltociear (#89), @gfggithubleet (#91 and #93), @shresthasurav (#94), @THEFZNKHAN (#98 and #99)
- Fixing a bug when learning bpe without special tokens by @Natooz in #92
- Switch lint/isort/format to Ruff by @akx in #105
- Adding pitch interval option by @Natooz in #103
- Switching to pyproject.toml and hatch packaging by @Natooz in #106
- Fix data augment by @parneyw in #109
- dealing with empty midi file by @feiyuehchen in #110
- Better tests + minor improvements by @Natooz in #108
New Contributors
- @eltociear made their first contribution in #89
- @gfggithubleet made their first contribution in #91
- @shresthasurav made their first contribution in #94
- @THEFZNKHAN made their first contribution in #98
- @akx made their first contribution in #105
- @parneyw made their first contribution in #109
- @feiyuehchen made their first contribution in #110
Full Changelog: v2.1.7...v2.1.8
v2.1.7 Hugging Face Hub integration
This release bring the integration of the Hugging Face Hub, along with a few important fixes and improvements!
What's Changed
- #87 Hugging Face hub integration! You can now push and load MidiTok tokenizers from the Hugging Face hub, using the
.from_pretrained
andpush_to_hub
methods as you would do for your models! Special thanks to @Wauplin and @julien-c for the help and support! 馃馃 - #80 (#78 @leleogere) Adding
func_to_get_labels
argument toDatasetTok
allowing to use it to retrieve labels when loading data; - #81 (#74 @Chunyuan-Li) Fixing multi-stream decoding with several identical programs + fixes with the encoding / decoding of time signatures for Bar-based tokenizers;
- #84 (#77 @VDT5702) Fix in
detect_chords
when checking whether to use unknown chords; - #82 (#79 @leleogere)
tokenize_midi_dataset
now reproduces the file tree of the source files. This change fixes issues when files with the same name were overwritten in the previous method. You can also specify wether to overwrite files in the destination directory or not.
Full Changelog: v2.1.6...v2.1.7
v2.1.6 Program Changes and fixes
Changelog
- #72 (#71) adding
program_change
config option, that will insertProgram
tokens whenever an event is from a different track than the previous one. They mimic the MIDIProgramChange
messages. If this parameter is disabled (by default), aProgram
token will prepend each track programs (as done in previous versions); - #72
MIDILike
decoding optimized; - #72 deduplicating overlapping pitch bends during preprocess;
- #72
tokenize_check_equals
test method and more test cases; - #75 and #76 (#73 and #74 by @Chunyuan-Li) Fixing time signature encoding / decoding Time Signature workflows for
Bar
/Position
-based tokenizer (REMI
,CPWord
,Octuple
,MMM
; - #76
Octuple
is now tested with time signature disabled: asTimeSig
tokens are only carried with notes,Octuple
cannot accurately represent time signatures; as a result, if a Time Signature change occurs and that the following bar do not contain any note, the time will be shifted by one or multiple bars depending on the previous time signature numerator and time gap between the last and current note. We do not recommend to useOctuple
with MIDIs with several time signature changes (at least numerator changes); - #76
MMM
tokenization workflow speedup.
v2.1.5 Successive TimeShifts / Rests
Changelog
- #69 bacea19 sort notes in all cases when tokenizing as MIDIs can contain unsorted notes;
- #70 (#68) New
one_token_stream_for_programs
parameter allowing treat all tracks of a MIDI as a single stream of tokens (addingProgram
tokens beforePitch
/NoteOn
...). This option is enabled by default, and corresponds to the default code behaviour of the previous versions. Disabling it allows to haveProgram
tokens in the vocabulary (config.use_programs
enabled) while converting each track independently; - #70 (#68)
TimeShift
andRest
tokens can now be created successively during the tokenization, happening when the largestTimeShift
/Rest
value of the tokenizer isn't sufficient; - #70 (#68) Rests are now represented using the same format as
TimeShift
s, and theconfig.rest_range
parameter has been renamedbeat_res_rest
for simplicity and flexibility. The default value is{(0, 1): 8, (1, 2): 4, (2, 12): 2}
;
Full Changelog: v2.1.4...v2.1.5
Thanks to @caenopy for reporting the bugs fixed here.
Compatibility
- tokenizers of previous versions with
rest_range
parameter will be converted to the newbeat_res_rest
format.
v2.1.4 Sustain pedal and pitch bend support
Changelog
- @ilya16 2e1978f Fix in
save_tokens
method, readingkwargs
in the json file saved; - #67 Adding sustain pedal and pitch bend tokens for
REMI
,TSD
andMIDILike
tokenizers
Compatibility
MMM
now adds additional tokens in the same order than other tokenizers, meaning previously savedMMM
tokenizers with these tokens would need to be converted if needed.
v2.1.3 New tokenization workflow, speedups, time signature and PyTorch data loading module
This big update brings a few important changes and improvements.
A new common tokenization workflow for all tokenizers.
We distinguish now three types of tokens:
- Global MIDI tokens, which represent attributes and events affecting the music globally, such as the tempo or time signature;
- Track tokens, representing values of distinct tracks such as the notes, chords or effects;
- Time tokens, which serve to structure and place the previous categories of tokens in time.
All tokenisations now follows the pattern:
- Preprocess the MIDI;
- Gather global MIDI events (tempo...);
- Gather track events (notes, chords);
- If "one token stream", concatenate all global and track events and sort them by time of occurrence. Else, concatenate the global events to each sequence of track events;
- Deduce the time events for all the sequences of events (only one if "one token stream");
- Return the tokens, as a combination of list of strings and list of integers (token ids).
This cleans considerably the code (DRY, less redundant methods), while bringing speedups as the calls to sorting methods has been reduced.
TLDR; other changes
- New submodule
pytorch_data
offering PyTorchDataset
objects and a data collator, to be used when training a PyTorch model. Learn more in the documentation of the module; MIDILike
,CPWord
andStructured
now handle nativelyProgram
tokens in a multitrack /one_token_stream
way;- Time signature changes are now handled by
TSD
,MIDILike
andCPWord
; - The
time_signature_range
config option is now more flexible / convenient.
Changelog
- #61 new
pytorch_data
submodule, withDatasetTok
andDatasetJsonIO
classes. This module is only loaded iftorch
is installed in the python environment; - #61
tokenize_midi_dataset()
method now have atokenizer_config_file_name
argument, allowing to save the tokenizer config with a custom file name; - #61 "all-in-one"
DataCollator
object to be used with PyTorchDataLoader
s; - #62
Structured
andMIDILike
now natively handleProgram
tokens. When settingconfig.use_programs
true, aProgram
token will be added before eachPitch
/NoteOn
/NoteOff
token to associate its instrument. MIDIs will also be treated as a single stream of tokens in this case, whereas otherwise each track is converted into independent token sequences; - #62
miditok.utils.remove_duplicated_notes
method can now remove notes with the same pitch and onset time, regardless of their offset time / duration; - #62
miditok.utils.merge_same_program_tracks
is now called inpreprocess_midi
whenconfig.use_programs
is True; - #62 Big refactor of the
REMI
codebase, that now has all the features ofREMIPlus
, and code clean and speedups (less calls to sorting). TheREMIPlus
class is now basically only a wrappedREMI
with programs and time signature enabled; - #62
TSD
andMIDILike
now encode and decode time signature changes; - #63 @ilya16 The
Tempo
s can now be created with a logarithmic scale, instead of the default linear scale. - c53a008 and 5d1c12e
track_to_tokens
andtokens_to_track
methods are now partially removed. They are now protected, for classes that still rely on them, and removed from the others. These methods were made for internal calls and not recommended to use. Instead, themidi_to_tokens
method is recommended; - #65 @ilya16 changes
time_signature_range
into a dictionary{denom_i: [num_i1, ..., num_in] / (min_num_i, max_num_i)}
; - #65 @ilya16 fix in the formula computing the number of ticks per bar.
- #66 Adds an option to
TokenizerConfig
to delete the successive tempo / time signature changes carrying the same value during MIDI preprocessing; - #66 now using xdist for tests, big speedup on Github actions (ty @ilya16 !);
- #66
CPWord
andOctuple
now follow the common tokenization workflow; - #66 As a consequence to the previous point,
OctupleMono
is removed as there was no records of its use. It is now equivalent toOctuple
withoutconfig.use_programs
; - #66
CPWord
now handling time signature changes; - #66 tests for tempo and time signatures changes are now more robust, exceptions were removed and fixed.
- 5a6378b
save_tokens
now by default doesn't save programs ifconfig.use_programs
is False
Compatibility
- Calls to
track_to_tokens
andtokens_to_track
methods are not supported anymore. If you used these methods, you may replace them withmidi_to_tokens
andtokens_to_midi
(or just call the tokenizer) while selecting the appropriate token sequences / tracks; time_signature_range
now needs to be given as a dictionary;- Due to changes in the order of vocabularies of
Octuple
(as programs are now optional), tokenizers and tokens made with previous versions will not be compatible unless the vocabulary order is swapped, idx 3 moved to 5.