Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with remove_duplicate #32

Open
cifkao opened this issue Dec 26, 2020 · 4 comments
Open

Problems with remove_duplicate #32

cifkao opened this issue Dec 26, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@cifkao
Copy link
Contributor

cifkao commented Dec 26, 2020

The remove_duplicate method has two problems:

  • It requires the list to be sorted, which is not documented.
  • The sorting done by the sort method is not sufficient for this purpose, because it only sorts by time. To catch all duplicates, one would need to sort by all attributes (including non-primitive ones, which are not comparable). (For example duplicate tracks will get removed iff they are adjacent.)

Since making all MusPy objects (of the same type) comparable in a meaningful way would be complicated, the only reasonable solution seems to be to remove duplicates without sorting. For that, it would be sufficient to make objects hashable. Or JSON strings can be used as keys for a quick solution.

@salu133445
Copy link
Owner

Yes, the current implementation of remove_duplicate only works as expected if the sort method sorts objects with all attributes, which is not the case for now. I agree that making MusPy objects comparable would be complicated and misleading in some way.

I think simple equality check is sufficient as we can do this independently for each time step by finding all objects that have the same time and compare them against one another.

However, we might need a policy (possibly given as an argument) to decide which object, the former one or the latter one, to keep. For example, if we have tempos = [Tempo(time=0, qpm=100), Tempo(time=0, qpm=120), Tempo(time=0, qpm=100)], we probably want to keep the last one, but the result can be either [Tempo(time=0, qpm=120), Tempo(time=0, qpm=100)] or [Tempo(time=0, qpm=100)], which depends on user's preference.

@cifkao
Copy link
Contributor Author

cifkao commented Dec 26, 2020

I think simple equality check is sufficient as we can do this independently for each time step by finding all objects that have the same time and compare them against one another.

Right, that would work.

However, we might need a policy (possibly given as an argument) to decide which object, the former one or the latter one, to keep. For example, if we have tempos = [Tempo(time=0, qpm=100), Tempo(time=0, qpm=120), Tempo(time=0, qpm=100)], we probably want to keep the last one, but the result can be either [Tempo(time=0, qpm=120), Tempo(time=0, qpm=100)] or [Tempo(time=0, qpm=100)], which depends on user's preference.

Hmm, I wasn't even thinking of this case, what I had in mind were exactly duplicate objects that don't get sorted next to each other because there is some other object at the same time step which will end up between them.

@cifkao cifkao mentioned this issue Dec 26, 2020
@cifkao
Copy link
Contributor Author

cifkao commented Dec 27, 2020

I think simple equality check is sufficient as we can do this independently for each time step by finding all objects that have the same time and compare them against one another.

Actually, this doesn't work for objects that don't have a time attribute (e.g. tracks). Right now remove_duplicate will consider such objects, but there is no way to sort them.

@salu133445
Copy link
Owner

I think the high-level idea is to do it for each time step if an object has a time attribute, otherwise do it for all list items.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants