Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event representation with multi-track support #27

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

cifkao
Copy link
Contributor

@cifkao cifkao commented Dec 25, 2020

I implemented multi-track support for the event-based representation. Here is a summary of the changes:

  • I added an explicit vocab dictionary, mapping human-readable event tuples (e.g. ('time_shift', 12)) to integers. This removes the need for doing index arithmetic to work out the ranges for different events, which would be extra complicated in the multi-track case. Moreover:
    • It will be easier to add other kinds of events, such as control changes (add "control_change" message events to music representation #17).
    • We can easily exclude events which we are not going to use (such as velocity events) to reduce the vocabulary size.
    • We can find out the vocab size simply using len(vocab)
    • The user could even provide their own vocab, e.g. if they want to exclude some events they know will never appear in their data.
  • Keeping the input and output functions in different places would result in duplicate code and a need to re-compute stuff every time. For this reason, I moved all of the code to the processor class.
  • Multi-track support is implemented by having a separate set of events for each track, except for time shift events, which are common to all tracks.
  • The default behavior is still single-track, and should be backward-compatible (I didn't touch existing tests and they are still passing)
  • Instruments (program + is_drum values) are (optionally) represented by special events, which always appear at the very beginning of the sequence (the instrument setting applies to the whole track). By default, we don't encode the drum program (drum kit).
  • (not really related) Zero-duration notes are extended to 1 tick. Zero durations is something we most likely don't want, as it can lead to some ugly things, see Inconsistent behaviour for short (zero-duration) notes  craffel/pretty-midi#161, Reading duplicate note-off events results in zero-duration notes craffel/pretty-midi#131. But I'm not sure if it should be handled here, or rather when writing to MIDI.

To consider:

  • We can discuss different ways multi-track support could be implemented, but this was the only sensible one I could come up with, and I've seen similar things being used in the literature. It's also pretty flexible, as the user is free to organize the music into tracks however they want. For example, they can create separate tracks for accompaniment and melody, group their instruments by family, or have a separate track for each MIDI program.
  • There are now several boolean parameters which enable different features, and more will probably be added in the future. Maybe there is a better/more compact/readable way to specify these, e.g. as a list of feature names or as enum.Flag...

Sorry for this big PR and I'm happy to discuss everything.

@cifkao cifkao marked this pull request as ready for review December 25, 2020 16:09
@codecov-io

This comment has been minimized.

@salu133445
Copy link
Owner

Haven't reviewed the code yet, but this is awesome! Thank you!

Another way to support multitracks in event-like representation is to use a text-based representation. For example,

  • LakhNES uses NO_NOTEON_13, DT_732, NO_NOTEOFF, DT_2197, P1_NOTEON_76, DT_2, P1_NOTEON_70, DT_1463, P2_NOTEON_87, DT_1, P2_NOTEOFF, DT_148, …, P1_NOTEON_61, DT_41, P2_NOTEON_85, DT_26, TR_NOTEON_39, DT_2, TR_NOTEON_43, DT_23, NO_NOTEON_9, DT_646, NO_NOTEON_13, DT_733
  • MuseNet uses bach piano_strings start tempo90 piano:v72:G1 piano:v72:G2 piano:v72:B4 piano:v72:D4 violin:v80:G4 piano:v72:G4 piano:v72:B5 piano:v72:D5 wait:12 piano:v0:B5 wait:5 piano:v72:D5 wait:12 piano:v0:D5 wait:4 piano:v0:G1 piano:v0:G2 piano:v0:B4 piano:v0:D4 violin:v0:G4 piano:v0:G4 wait:1 piano:v72:G5 wait:12 piano:v0:G5 wait:5 piano:v72:D5 wait:12 piano:v0:D5 wait:5 piano:v72:B5 wait:12

@cifkao
Copy link
Contributor Author

cifkao commented Dec 25, 2020

Yeah, I was thinking exactly of these 2 examples . What I'm proposing should be equivalent to LakhNES, only I use Python tuples instead of strings, and I index tracks with numbers. E.g. ('note_on', 1, 87) instead of P2_NOTEON_87. So to create the LakhNES representation using my implementation, one just needs to sort the tracks (and perhaps add some empty tracks as padding) to make sure each instrument always has the same index.

MuseNet is similar, except that it encodes velocity differently.

A different option would be to use track names instead of indices, e.g. ('note_on', 'P2', 87), but then the track names would have to be known ahead of time in order to build the vocab.

@cifkao cifkao force-pushed the multitrack-event branch 3 times, most recently from effa63b to 7c47efd Compare December 28, 2020 18:58
Copy link
Owner

@salu133445 salu133445 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several issues in the implementation. Let me start with the biggest one.

I was expecting these events to be more compact and decoupled. For example, we currently represent a note by a series of velocity, note-on, time-shift and note-off events. We should represent a note in a track in a similar way - a series of track, velocity, note-on, time-shift, note-off events, where track and velocity events are optional if unchanged and work like switches (more like set_program and set_velocity).

There are several benefits for doing so.

  • The vocabulary size would be independent of the number of tracks.
  • We don't need num_tracks in encode anymore.
  • Argument ignore_empty_tracks can be removed. This argument is quite confusing as it have different effects in encoding and decoding.

Some other comments:

  • Is the current implementation used in any literature? We could add the current implementation as a new representation if you like.
  • We can base the text-based representation on the current implementation.
  • I would still prefer to keep the main implementation in muspy.inputs and muspy.outputs. The processor interface is just a way to store the configurations for the functional interface.

Thank you for working on this. Let me know what you think.

muspy/processors/event.py Outdated Show resolved Hide resolved
muspy/processors/event.py Outdated Show resolved Hide resolved
muspy/processors/event.py Outdated Show resolved Hide resolved
tracks = [[n for track in music.tracks for n in track.notes]]
else:
# Keep only num_tracks non-empty tracks
# TODO: Maybe warn or throw exception if too many tracks
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe warns for the processor interface, but pass for functional interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning. If not wanted, they can be suppressed using warnings.filterwarnings.

muspy/processors/event.py Outdated Show resolved Hide resolved
@cifkao
Copy link
Contributor Author

cifkao commented Dec 31, 2020

I was expecting these events to be more compact and decoupled. For example, we currently represent a note by a series of velocity, note-on, time-shift and note-off events. We should represent a note in a track in a similar way - a series of track, velocity, note-on, time-shift, note-off events, where track and velocity events are optional if unchanged and work like switches (more like set_program and set_velocity).

There are several benefits for doing so.

  • The vocabulary size would be independent of the number of tracks.
  • We don't need num_tracks in encode anymore.

I completely agree with these benefits, but a big disadvantage are longer sequences. There is a trade-off (small vocabulary vs. short sequences), and in some scenarios (typically when the number of tracks is small and fixed), the latter would be preferable.

Other benefits of this representation are that:

  • it allows to separate different tracks conceptually even if they use the same program (e.g. piano left and right hand)
  • it allows to distinguish note-offs of different instruments

Some other comments:

  • Is the current implementation used in any literature?

The LakhNES representation is equivalent to this representation (or rather a special case – without the program and velocity tokens). And MuseNet is not that different either. That's why I wanted to contribute it here.

We could add the current implementation as a new representation if you like.

That would make sense. Though keep in mind that my implementation still supports the single-track version as a special case. This one doesn't encode programs, but that could be easily implemented exactly the way you wrote (by adding a program token in front of every note-on). So I think the two ideas are not incompatible and could be part of the same class (or perhaps extensions of the same base class).

  • We can base the text-based representation on the current implementation.

To me, there's no real difference between the tuple-based and the string-based representation, since we immediately convert to integers anyway. But sure, we could instead format the tuples as strings.

  • I would still prefer to keep the main implementation in muspy.inputs and muspy.outputs. The processor interface is just a way to store the configurations for the functional interface.

I chose to move it to the class because I need to store the vocab. Of course you can re-generate the vocab every time, but then it's not accessible from outside, so anyone using the representation will have to figure out the vocab size on their own. Also, you will either have duplicate code (mainly for generating the vocab) or we'll have to find a common place to put it.

@cifkao
Copy link
Contributor Author

cifkao commented Dec 31, 2020

Also, if you plan to add support for control change events, it really makes sense to have a track-based representation. You don't want the piano's sustain pedal to apply to other instruments.

@cifkao cifkao requested a review from salu133445 January 3, 2021 20:00
@cifkao
Copy link
Contributor Author

cifkao commented Jan 7, 2021

I just changed encode() to use the default integer type (usually int64) instead of uint16. Two reasons:

  • 2^16 is not such a big number, I can imagine having a bigger vocabulary if the number of tracks is very high.
  • PyTorch doesn't support uint16.

Also, I'm not sure about the usefulness of the reshaping. If I load the dataset with the PyTorch DataLoader, I would have to undo it later, because the loader is going to stack the items. I think with TensorFlow it will be the same.

See also #45.

@salu133445
Copy link
Owner

Sorry for the delay. I have been thinking about how we should handle this for days. My main concern lies in the necessity of the vocabulary dictionary. For such a large vocabulary, I think it's more flexible to return the raw vocabularies (like a text-based representation) rather than a series of integers that cannot be decoded without some vocabulary dictionary. With the raw vocabularies, users can later implement their own indexer to index the vocabulary. Also, such a large vocabulary makes it inefficient to run using the functional API (i.e., to_event_representation and from_event_representation).

Here's my suggestions.

  • Keep the vocabulary of to_event_representation and from_event_representation simple. To do so, MIDI-like set-program is preferable over track prefix on every note-ons. We should also consider making the vocabulary independent of the parameters. For example, use 388 for EOS even when encode_velocity=False. This way, we don't even need any arguments to decode the data.
  • Make the proposed representation into to_text_representation that outputs raw strings. We already have a good starting point from this pull request.
  • This way, the event representation outputs integers, each of which has a particular fixed meaning, and works pretty much like a simplified MIDI notation, while the text representation outputs stings that supports advanced syntaxs.

What do you think? Is there any benefit for having this implemented in event representation?

@cifkao
Copy link
Contributor Author

cifkao commented Jan 13, 2021

I'm not completely against a text representation and it has its advantages, but some things to consider:

  • Such a representation cannot be directly used with functions like to_pytorch_dataset (technically it can, but the user would then have to write a wrapper around the dataset to convert the tokens to ints). But that's maybe not such a problem (I'm actually already using a custom factory function anyway, because I need to add the start symbol to each sequence).
  • It's really convenient to already have the vocabulary available in any case. Having to build it from the dataset is an extra (possibly time consuming) step and it's less robust (let's say you pre-train on dataset A and then decide to continue training on dataset B, but realize it contains some tokens not contained in A). So I would like at least some way to get the complete list of possible tokens.
  • By having the vocab available, we also get some kind of sanity check for free (invalid tokens automatically cause an error).

So what we could do is merge this as a separate AdvancedEventRepresentationProcessor, possibly without providing the functional API for it (since it brings some complications). And provide a parameter which allows to choose whether to return ints, tuples, or (maybe in the future) strings.

@salu133445
Copy link
Owner

  • Such a representation cannot be directly used with functions like to_pytorch_dataset (technically it can, but the user would then have to write a wrapper around the dataset to convert the tokens to ints). But that's maybe not such a problem (I'm actually already using a custom factory function anyway, because I need to add the start symbol to each sequence).

That's true. People who want to use more advanced representation would probably want to use other fancy stuff I guess. I think that would be fine. Most machine learning framework should support indexer (e.g., Hugging Face).

  • It's really convenient to already have the vocabulary available in any case. Having to build it from the dataset is an extra (possibly time consuming) step and it's less robust (let's say you pre-train on dataset A and then decide to continue training on dataset B, but realize it contains some tokens not contained in A). So I would like at least some way to get the complete list of possible tokens.

That's a trade-off in fact - having a large vocabulary with some words that never got used or having a smaller vocabulary that depends on the data. It actually depends on how people implement the indexer, but we could have a function that returns the complete list of possible tokens for sure.

  • By having the vocab available, we also get some kind of sanity check for free (invalid tokens automatically cause an error).

We still have the vocabulary implicitly. The key is that the output itself does not depends on the vocabulary. We could have another function that returns the complete list (e.g., to check an user's inputs).

So what we could do is merge this as a separate AdvancedEventRepresentationProcessor, possibly without providing the functional API for it (since it brings some complications). And provide a parameter which allows to choose whether to return ints, tuples, or (maybe in the future) strings.

Sounds like a plan! We might also need some documentation.

@cifkao cifkao requested a review from salu133445 March 2, 2021 20:00
@cifkao
Copy link
Contributor Author

cifkao commented Mar 2, 2021

@salu133445 OK, I moved this to a separate class and also added some tests and docs.

@cifkao
Copy link
Contributor Author

cifkao commented Mar 2, 2021

I also added the encode_as_tuples and encode_as_strings methods. At first I wanted to have the format as an additional parameter to encode (and/or __init__), but this would not work well with type hints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants