Event representation with multi-track support #27

cifkao · 2020-12-25T16:05:14Z

I implemented multi-track support for the event-based representation. Here is a summary of the changes:

I added an explicit vocab dictionary, mapping human-readable event tuples (e.g. ('time_shift', 12)) to integers. This removes the need for doing index arithmetic to work out the ranges for different events, which would be extra complicated in the multi-track case. Moreover:
- It will be easier to add other kinds of events, such as control changes (add "control_change" message events to music representation #17).
- We can easily exclude events which we are not going to use (such as velocity events) to reduce the vocabulary size.
- We can find out the vocab size simply using len(vocab)
- The user could even provide their own vocab, e.g. if they want to exclude some events they know will never appear in their data.
Keeping the input and output functions in different places would result in duplicate code and a need to re-compute stuff every time. For this reason, I moved all of the code to the processor class.
Multi-track support is implemented by having a separate set of events for each track, except for time shift events, which are common to all tracks.
The default behavior is still single-track, and should be backward-compatible (I didn't touch existing tests and they are still passing)
Instruments (program + is_drum values) are (optionally) represented by special events, which always appear at the very beginning of the sequence (the instrument setting applies to the whole track). ~~By default, we don't encode the drum program (drum kit).~~
(not really related) Zero-duration notes are extended to 1 tick. Zero durations is something we most likely don't want, as it can lead to some ugly things, see Inconsistent behaviour for short (zero-duration) notes craffel/pretty-midi#161, Reading duplicate note-off events results in zero-duration notes craffel/pretty-midi#131. But I'm not sure if it should be handled here, or rather when writing to MIDI.

To consider:

We can discuss different ways multi-track support could be implemented, but this was the only sensible one I could come up with, and I've seen similar things being used in the literature. It's also pretty flexible, as the user is free to organize the music into tracks however they want. For example, they can create separate tracks for accompaniment and melody, group their instruments by family, or have a separate track for each MIDI program.
There are now several boolean parameters which enable different features, and more will probably be added in the future. Maybe there is a better/more compact/readable way to specify these, e.g. as a list of feature names or as enum.Flag...

Sorry for this big PR and I'm happy to discuss everything.

salu133445 · 2020-12-25T19:08:48Z

Haven't reviewed the code yet, but this is awesome! Thank you!

Another way to support multitracks in event-like representation is to use a text-based representation. For example,

LakhNES uses NO_NOTEON_13, DT_732, NO_NOTEOFF, DT_2197, P1_NOTEON_76, DT_2, P1_NOTEON_70, DT_1463, P2_NOTEON_87, DT_1, P2_NOTEOFF, DT_148, …, P1_NOTEON_61, DT_41, P2_NOTEON_85, DT_26, TR_NOTEON_39, DT_2, TR_NOTEON_43, DT_23, NO_NOTEON_9, DT_646, NO_NOTEON_13, DT_733
MuseNet uses bach piano_strings start tempo90 piano:v72:G1 piano:v72:G2 piano:v72:B4 piano:v72:D4 violin:v80:G4 piano:v72:G4 piano:v72:B5 piano:v72:D5 wait:12 piano:v0:B5 wait:5 piano:v72:D5 wait:12 piano:v0:D5 wait:4 piano:v0:G1 piano:v0:G2 piano:v0:B4 piano:v0:D4 violin:v0:G4 piano:v0:G4 wait:1 piano:v72:G5 wait:12 piano:v0:G5 wait:5 piano:v72:D5 wait:12 piano:v0:D5 wait:5 piano:v72:B5 wait:12

cifkao · 2020-12-25T19:32:17Z

Yeah, I was thinking exactly of these 2 examples . What I'm proposing should be equivalent to LakhNES, only I use Python tuples instead of strings, and I index tracks with numbers. E.g. ('note_on', 1, 87) instead of P2_NOTEON_87. So to create the LakhNES representation using my implementation, one just needs to sort the tracks (and perhaps add some empty tracks as padding) to make sure each instrument always has the same index.

MuseNet is similar, except that it encodes velocity differently.

A different option would be to use track names instead of indices, e.g. ('note_on', 'P2', 87), but then the track names would have to be known ahead of time in order to build the vocab.

salu133445

There are several issues in the implementation. Let me start with the biggest one.

I was expecting these events to be more compact and decoupled. For example, we currently represent a note by a series of velocity, note-on, time-shift and note-off events. We should represent a note in a track in a similar way - a series of track, velocity, note-on, time-shift, note-off events, where track and velocity events are optional if unchanged and work like switches (more like set_program and set_velocity).

There are several benefits for doing so.

The vocabulary size would be independent of the number of tracks.
We don't need num_tracks in encode anymore.
Argument ignore_empty_tracks can be removed. This argument is quite confusing as it have different effects in encoding and decoding.

Some other comments:

Is the current implementation used in any literature? We could add the current implementation as a new representation if you like.
We can base the text-based representation on the current implementation.
I would still prefer to keep the main implementation in muspy.inputs and muspy.outputs. The processor interface is just a way to store the configurations for the functional interface.

Thank you for working on this. Let me know what you think.

muspy/processors/event.py

salu133445 · 2020-12-31T09:58:00Z

muspy/processors/event.py

+ tracks = [[n for track in music.tracks for n in track.notes]]
+ else:
+ # Keep only num_tracks non-empty tracks
+ # TODO: Maybe warn or throw exception if too many tracks


Maybe warns for the processor interface, but pass for functional interface.

I added a warning. If not wanted, they can be suppressed using warnings.filterwarnings.

muspy/processors/event.py

cifkao · 2020-12-31T12:59:14Z

I was expecting these events to be more compact and decoupled. For example, we currently represent a note by a series of velocity, note-on, time-shift and note-off events. We should represent a note in a track in a similar way - a series of track, velocity, note-on, time-shift, note-off events, where track and velocity events are optional if unchanged and work like switches (more like set_program and set_velocity).

There are several benefits for doing so.

The vocabulary size would be independent of the number of tracks.

We don't need num_tracks in encode anymore.

I completely agree with these benefits, but a big disadvantage are longer sequences. There is a trade-off (small vocabulary vs. short sequences), and in some scenarios (typically when the number of tracks is small and fixed), the latter would be preferable.

Other benefits of this representation are that:

it allows to separate different tracks conceptually even if they use the same program (e.g. piano left and right hand)
it allows to distinguish note-offs of different instruments

Some other comments:

Is the current implementation used in any literature?

The LakhNES representation is equivalent to this representation (or rather a special case – without the program and velocity tokens). And MuseNet is not that different either. That's why I wanted to contribute it here.

We could add the current implementation as a new representation if you like.

That would make sense. Though keep in mind that my implementation still supports the single-track version as a special case. This one doesn't encode programs, but that could be easily implemented exactly the way you wrote (by adding a program token in front of every note-on). So I think the two ideas are not incompatible and could be part of the same class (or perhaps extensions of the same base class).

We can base the text-based representation on the current implementation.

To me, there's no real difference between the tuple-based and the string-based representation, since we immediately convert to integers anyway. But sure, we could instead format the tuples as strings.

I would still prefer to keep the main implementation in muspy.inputs and muspy.outputs. The processor interface is just a way to store the configurations for the functional interface.

I chose to move it to the class because I need to store the vocab. Of course you can re-generate the vocab every time, but then it's not accessible from outside, so anyone using the representation will have to figure out the vocab size on their own. Also, you will either have duplicate code (mainly for generating the vocab) or we'll have to find a common place to put it.

cifkao · 2020-12-31T13:04:23Z

Also, if you plan to add support for control change events, it really makes sense to have a track-based representation. You don't want the piano's sustain pedal to apply to other instruments.

cifkao · 2021-01-07T15:52:55Z

I just changed encode() to use the default integer type (usually int64) instead of uint16. Two reasons:

2^16 is not such a big number, I can imagine having a bigger vocabulary if the number of tracks is very high.
PyTorch doesn't support uint16.

Also, I'm not sure about the usefulness of the reshaping. If I load the dataset with the PyTorch DataLoader, I would have to undo it later, because the loader is going to stack the items. I think with TensorFlow it will be the same.

See also #45.

salu133445 · 2021-01-13T10:36:02Z

Sorry for the delay. I have been thinking about how we should handle this for days. My main concern lies in the necessity of the vocabulary dictionary. For such a large vocabulary, I think it's more flexible to return the raw vocabularies (like a text-based representation) rather than a series of integers that cannot be decoded without some vocabulary dictionary. With the raw vocabularies, users can later implement their own indexer to index the vocabulary. Also, such a large vocabulary makes it inefficient to run using the functional API (i.e., to_event_representation and from_event_representation).

Here's my suggestions.

Keep the vocabulary of to_event_representation and from_event_representation simple. To do so, MIDI-like set-program is preferable over track prefix on every note-ons. We should also consider making the vocabulary independent of the parameters. For example, use 388 for EOS even when encode_velocity=False. This way, we don't even need any arguments to decode the data.
Make the proposed representation into to_text_representation that outputs raw strings. We already have a good starting point from this pull request.
This way, the event representation outputs integers, each of which has a particular fixed meaning, and works pretty much like a simplified MIDI notation, while the text representation outputs stings that supports advanced syntaxs.

What do you think? Is there any benefit for having this implemented in event representation?

cifkao · 2021-01-13T12:51:23Z

I'm not completely against a text representation and it has its advantages, but some things to consider:

Such a representation cannot be directly used with functions like to_pytorch_dataset (technically it can, but the user would then have to write a wrapper around the dataset to convert the tokens to ints). But that's maybe not such a problem (I'm actually already using a custom factory function anyway, because I need to add the start symbol to each sequence).
It's really convenient to already have the vocabulary available in any case. Having to build it from the dataset is an extra (possibly time consuming) step and it's less robust (let's say you pre-train on dataset A and then decide to continue training on dataset B, but realize it contains some tokens not contained in A). So I would like at least some way to get the complete list of possible tokens.
By having the vocab available, we also get some kind of sanity check for free (invalid tokens automatically cause an error).

So what we could do is merge this as a separate AdvancedEventRepresentationProcessor, possibly without providing the functional API for it (since it brings some complications). And provide a parameter which allows to choose whether to return ints, tuples, or (maybe in the future) strings.

salu133445 · 2021-01-13T13:38:29Z

Such a representation cannot be directly used with functions like to_pytorch_dataset (technically it can, but the user would then have to write a wrapper around the dataset to convert the tokens to ints). But that's maybe not such a problem (I'm actually already using a custom factory function anyway, because I need to add the start symbol to each sequence).

That's true. People who want to use more advanced representation would probably want to use other fancy stuff I guess. I think that would be fine. Most machine learning framework should support indexer (e.g., Hugging Face).

It's really convenient to already have the vocabulary available in any case. Having to build it from the dataset is an extra (possibly time consuming) step and it's less robust (let's say you pre-train on dataset A and then decide to continue training on dataset B, but realize it contains some tokens not contained in A). So I would like at least some way to get the complete list of possible tokens.

That's a trade-off in fact - having a large vocabulary with some words that never got used or having a smaller vocabulary that depends on the data. It actually depends on how people implement the indexer, but we could have a function that returns the complete list of possible tokens for sure.

By having the vocab available, we also get some kind of sanity check for free (invalid tokens automatically cause an error).

We still have the vocabulary implicitly. The key is that the output itself does not depends on the vocabulary. We could have another function that returns the complete list (e.g., to check an user's inputs).

So what we could do is merge this as a separate AdvancedEventRepresentationProcessor, possibly without providing the functional API for it (since it brings some complications). And provide a parameter which allows to choose whether to return ints, tuples, or (maybe in the future) strings.

Sounds like a plan! We might also need some documentation.

muspy/processors/event.py

cifkao · 2021-03-02T20:03:03Z

@salu133445 OK, I moved this to a separate class and also added some tests and docs.

cifkao · 2021-03-02T20:07:35Z

I also added the encode_as_tuples and encode_as_strings methods. At first I wanted to have the format as an additional parameter to encode (and/or __init__), but this would not work well with type hints.

cifkao marked this pull request as ready for review December 25, 2020 16:09

This comment has been minimized.

Sign in to view

cifkao force-pushed the multitrack-event branch 3 times, most recently from effa63b to 7c47efd Compare December 28, 2020 18:58

salu133445 requested changes Dec 31, 2020

View reviewed changes

cifkao requested a review from salu133445 January 3, 2021 20:00

salu133445 reviewed Jan 13, 2021

View reviewed changes

muspy/processors/event.py Outdated Show resolved Hide resolved

salu133445 reviewed Jan 13, 2021

View reviewed changes

muspy/processors/event.py Outdated Show resolved Hide resolved

cifkao added 13 commits February 16, 2021 23:10

Add multi-track event-based representation

07c0f9b

Remove unused import

7fb8bb4

Warn about resolution mismatch

d3b055b

Fix linter errors

ac5f4a7

Use the variable I just defined

edad039

Better drum kit representation

65bdacf

Fix warning in event representation test

ac307c2

Fix variable type annotations

c366c1c

Add force_resolution

37890a5

Do not raise error if no notes are found

354d96e

Improve error message

5803328

Warn about number of tracks

eb6a35d

Use default int type instead of uint16

fbf92f4

Restore the old EventRepresentationProcessor

7ef9e7e

cifkao force-pushed the multitrack-event branch from 4831f7e to 7ef9e7e Compare February 16, 2021 22:17

Rename to AdvancedEventRepresentationProcessor

7dc0036

cifkao force-pushed the multitrack-event branch from e52a9ed to 7dc0036 Compare February 16, 2021 22:20

cifkao added 4 commits February 16, 2021 23:28

Update test

a02aeaf

Remove unnecessary velocity tokens

3af6ddb

Test single-track advanced representation

3113627

force_resolution -> check_resolution

885bb5d

cifkao force-pushed the multitrack-event branch from ca66a7f to 6b8f37b Compare February 28, 2021 18:40

Add string and tuple representations

735dba0

cifkao force-pushed the multitrack-event branch from 6b8f37b to 735dba0 Compare February 28, 2021 18:42

cifkao added 2 commits February 28, 2021 19:48

Fix tuple syntax

9d1fb8e

Add docs for advanced representation

f220f3c

cifkao requested a review from salu133445 March 2, 2021 20:00

Remove unused variable

a84de50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event representation with multi-track support #27

Event representation with multi-track support #27

cifkao commented Dec 25, 2020 •

edited

This comment has been minimized.

salu133445 commented Dec 25, 2020

cifkao commented Dec 25, 2020

salu133445 left a comment

salu133445 Dec 31, 2020

cifkao Jan 3, 2021

cifkao commented Dec 31, 2020 •

edited

cifkao commented Dec 31, 2020 •

edited

cifkao commented Jan 7, 2021 •

edited

salu133445 commented Jan 13, 2021

cifkao commented Jan 13, 2021 •

edited

salu133445 commented Jan 13, 2021

cifkao commented Mar 2, 2021

cifkao commented Mar 2, 2021

Event representation with multi-track support #27

Are you sure you want to change the base?

Event representation with multi-track support #27

Conversation

cifkao commented Dec 25, 2020 • edited

This comment has been minimized.

salu133445 commented Dec 25, 2020

cifkao commented Dec 25, 2020

salu133445 left a comment

Choose a reason for hiding this comment

salu133445 Dec 31, 2020

Choose a reason for hiding this comment

cifkao Jan 3, 2021

Choose a reason for hiding this comment

cifkao commented Dec 31, 2020 • edited

cifkao commented Dec 31, 2020 • edited

cifkao commented Jan 7, 2021 • edited

salu133445 commented Jan 13, 2021

cifkao commented Jan 13, 2021 • edited

salu133445 commented Jan 13, 2021

cifkao commented Mar 2, 2021

cifkao commented Mar 2, 2021

cifkao commented Dec 25, 2020 •

edited

cifkao commented Dec 31, 2020 •

edited

cifkao commented Dec 31, 2020 •

edited

cifkao commented Jan 7, 2021 •

edited

cifkao commented Jan 13, 2021 •

edited