Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(legacy): implement subset sum solution to show scheduling #3019

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

dakriy
Copy link

@dakriy dakriy commented May 21, 2024

Description

When running a radio station it is generally a good idea to reduce dead air time. The current algorithm for adding tracks to a block/show can leave a lot of dead air time at the end as it doesn't use a very good algorithm. Adding tracks to a show until it is full while making it as full as possible is a well known problem in computer science. It is the Subset Sum Problem. This PR implements a Randomized Greedy with Local Improvement (RGLI) approximation solution for the Subset Sum Problem. The new algorithm is only used when sort type is random and overflow is not enabled and there is no limit on the number of tracks that can be used.

This is a new feature:

Improvement on an existing feature.

I have not updated the documentation to reflect these changes:

I did not update the documentation because the current scheduling algorithm is not currently documented and its existing features have not changed.

Testing Notes

What I did:

I first attempted a fully polynomial time approximation scheme solution, however it is really bad at finding good solutions for high density values and can kinda slow the more tracks/time you have. So I instead implemented an RGLI which is O(nlogn) and has been giving much better results.

I implemented the solution in a separate project and tested it and timed the values with a normal distribution of 500 songs with a mean of 3 minutes and a standard deviation of 1 minute. With a show size of 1 hour the algorithm took around 10-15 ms to run. When adjusting the block size and track size the algorithm still was pretty quick to run. Am going to be testing on an instance with lots of tracks later, will update PR when I have done that.

How you can replicate my testing:

How can the reviewer validate this PR?

Links

Closes #3018

https://en.wikipedia.org/wiki/Subset_sum_problem#Fully-polynomial_time_approximation_scheme

https://web.stevens.edu/algebraic/Files/SubsetSum/przydatek99fast.pdf

… sum problem when possible for less dead space
@dakriy dakriy changed the title feat(legacy): Implement Subset Sum Solution to block scheduling feat(legacy): implement subset sum solution to show scheduling May 21, 2024
@paddatrapper paddatrapper requested a review from jooola May 22, 2024 16:35
…ves much better results for high density inputs.
@dakriy
Copy link
Author

dakriy commented Jun 3, 2024

Switched to RGLI as the fully polynomial approximation scheme gave unwanted results when input tracks were highly clustered around the same length. The RGLI is also giving better results at much faster speeds.

@dakriy
Copy link
Author

dakriy commented Jun 3, 2024

@jooola ran tests locally so hopefully it'll pass now

@dakriy
Copy link
Author

dakriy commented Jun 3, 2024

@jooola The build failed on a file that I never even touched, and correct me if I'm wrong but I feel fixing typos in the docs is out of scope for this PR. I can fix it if you want but I would like to avoid any merge conflicts. I did pull in the most recent master though.

@jooola
Copy link
Contributor

jooola commented Jun 4, 2024

Yes, I'll handle this. Do not lose too much time :)

@dakriy
Copy link
Author

dakriy commented Jun 5, 2024

Ok thanks! Was there anything else you needed me to do for this PR?

@paddatrapper
Copy link
Contributor

Fixed in #3027. Can you merge/rebase on main?

paddatrapper added a commit that referenced this pull request Jun 5, 2024
### Description

In the existing logic of `retrieveMediaFiles`, the time remaining in
show is calculated incorrectly in some scenarios. Each time a duration
is subtracted from `showLimit`, it is not the duration of the files just
added, but instead the length of all files scheduled. This can cause
cases where a smart block set to "time remaining in show" fails to
completely fill the program.

For example, given a 30 minute show, and a playlist like follows:

1. a 5 minute track
2. another 5 minute track
3. smart block, set to time remaining in show

When item 1 is added, `showLimit` is reduced by 5 minutes as expected.
When item 2 is added, `showLimit` is reduced by 10 minutes (as both
items 1 and 2 are counted). As a result, the smart block is only run to
fill 15 minutes, leaving 5 minutes unfilled.

This PR resolves this issue, by recalculating `showLimit` from the
original duration rather than subtracting from a running total.

This change not does implement a new feature and should not require any
changes to documentation.

### Testing Notes

**What I did:**

- On a dev environment, set up a playlist as described above.
- Before applying this PR, created a show and scheduled playlist, and
confirmed issue was reproducible
- Applied PR and repeated, and confirmed show was filled completely.

Also repeated this testing with sample data from our production
instance.

Here is a sample schedule of the "before" case with sample data, showing
the issue

![image](https://github.com/libretime/libretime/assets/8541186/f91849fb-606f-410e-bef5-a7abc8e7b7f4)
The smartblock that scheduled the music is set to allow last track to
overflow, but 3m55s was left unscheduled.

Using the same playlist and same track library, here is a sample
schedule after the PR applied:

![image](https://github.com/libretime/libretime/assets/8541186/e9d46fbb-50e6-4859-a3de-f5a90a6021c0)
As expected, the show is fully scheduled with the last track
overflowing.
 
Additionally, I've applied this PR as a hot fix to our production
instance, where it has been running for a week without issue.

Also performed spot tests of playlists without smart blocks, smart
blocks scheduled directly (not in playlists) and autoloading playlists,
with no change in behaviour observed as a result of this change.

**How you can replicate my testing:**

Same test steps as I followed should be able to reproduce issue &
validate fix on any instance.

### **Links**

Not directly related to issue fixed by #3019, but also addresses the
issue of dead air left at end of blocks.

Co-authored-by: Kyle Robbertze <[email protected]>
@dakriy
Copy link
Author

dakriy commented Jun 5, 2024

Merged @paddatrapper

@jooola
Copy link
Contributor

jooola commented Jun 7, 2024

Would it be possible to have a parameter to overshoot the total length of the block?

Based on the abstract of the document, I understand that the algorithm's goal is to find a subset of tracks with a total closest to the total length of the block, but never greater. But I'd prefer to always overshoot it, so we can always fade out a track if needed.

Ideally, we want to have the length closest to the block length, and always greater than the block length.

@jooola
Copy link
Contributor

jooola commented Jun 7, 2024

Also, I wonder how random this algorithm actually is? I'd prefer the best randomness and overshoot than be closer but always pick the same tracks.

I didn't fully understand the algorithm yet, but I understood that it tries to add the longest available tracks to the list, and if it doesn't fit, try with a shorter track, this until the total duration if filled. How can we make sure that the added tracks will always be as random as a simpler algorithm? Or did I mixed up the algorithms ?

@dakriy
Copy link
Author

dakriy commented Jun 11, 2024

@jooola sorry for the incoming wall of text. Please let me know if you want me to run any other tests or have any questions about or don't understand any of this.

Overshoot

For the “overshoot” concern, I did not remove the old “overflow” parameter. And when it is set it uses the old algorithm. The reasoning being, if overshoot or overflow is desired, you don’t need an efficient algorithm as you’re just going to fade out whenever. If you paired this algorithm with an overflow, in most cases you’d be fading out right at the beginning in the first few seconds (or milliseconds) of a song which wouldn’t make much sense anyway. That is why this algorithm is not run when overflow is desired. This algorithm was built for when you DON’T want overshoot and fade and you just want the scheduler to figure out scheduling such that the show ends right when the next show is supposed to start as smoothly as possible. As previously mentioned you still can overshoot and fade if you want. That functionality isn’t going away. This is just an optimization in the case of no overshooting.

Algorithm Details

You had a few incorrect ideas about how the algorithm works. To summarize the paper and simplify a bit I will walk through building up the algorithm starting with the simplest case:

The simplest case was implemented before (the random greedy algorithm (RG)). You just randomly select files until you hit the target length or there are no more files throwing out any tracks that would be too long. This was how it was previously done.

This is a pretty good way of doing things and gets decent results but it can be improved quite a bit.

The next logical step is to do the same RG selection many times (randomizing the input each time so you get different results) and select the best outcome. So you do the random selection say 50 times randomizing input track order each time and find the one that comes the closest to your target length.

That is also pretty good and gets you really good results, but still can be improved by going back over all the tracks after you have made your initial picks and replacing tracks with ones that are slightly longer to get you slightly closer to the target time if there are any. This is the local improvement (LI) step of the RGLI algorithm. It does the RG trials many times but after each trial it sees if it can make it a little bit better. The best trial out of all of these is selected and returned. This does lead to a slight bias to slightly longer songs in some scenarios, but for the most part should be random.

Randomness

I am assuming you are talking about randomness of the set of selected tracks (ignoring order) as that would make the most sense here. The randomness highly depends on the size of input data, length of input data, and the target length. The implemented algorithm does a random search through the entire space of combinations and finds the best one from all of it’s trials (with some local improvement). Since it is by nature a random search the output will be random, unless you have a low number of tracks and a total track length close to (but greater than) the desired target show length.

If the input tracks are less then the desired length there is no variation because all tracks get selected every time. No discernible difference from the old algorithm.

If the input tracks in the database is large (tracks in db > 500), the db selects 500 random every time anyway and you get a completely different set each time anyway. No difference from the old algorithm, except you get much better scheduling.
I did some entropy calculations on some runs. I focused on the more interesting cases where 10 < N < 500. A brief introduction to entropy if you are not familiar:

Entropy is a measure of randomness. The higher the entropy, the more random something is. Told you it was brief!

Data

In my testing I was computing entropy across 100 runs of the algorithms on the same items. If the items that came out were the exact same as another run (ignoring order) that counts against the randomness. It is important to keep in mind that a lot of the differences between the algorithms were around where the total length of the tracks is only slightly longer than the total length. So you get a lot of repetition between runs, they just aren’t the exact selection. For example, when testing a target length of an hour, and 20 tracks (around 3+ mins each), you end up with 15 or 16 chosen per run. Now, 20 choose 15 is 15504 different ways to do that. Now most of those have repetition between them but aren’t the exact same set. So for my testing I treated them as completely different as we were interested in randomness across distinct sets of runs.

The testing was done with random tracks from my library (My library contains 347ish hours of audio). For almost all cases the algorithms produced equally random results. The randomness differs in a few different edge cases It is mostly noticable when you have a low number of tracks and a total track length close to (but greater than) the desired target show length. For pretty much all other edge cases the randomness is similar. Let me know if you want the code used to make the data.

Now for the data!!
image
image
image
image

Code used to generate data

import com.github.doyaaaaaken.kotlincsv.dsl.csvWriter
import java.io.File
import javax.sound.sampled.AudioSystem
import javax.sound.sampled.UnsupportedAudioFileException
import kotlin.math.log2
import kotlin.random.Random
import kotlin.system.measureNanoTime
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.async
import kotlinx.coroutines.runBlocking
import kotlinx.coroutines.withContext
import org.tritonus.share.sampled.file.TAudioFileFormat

private fun <T> benchmark(block: () -> T): T {
    val result: T
    val time = measureNanoTime {
        result = block()
    }
    println("Took ${time / 1000000.0} ms")

    return result
}

fun main() {
//    entropyTesting()
    entropySweepNTarget1H()
    entropySweepNTarget2H()
    entropySweep20NTarget1H()
    entropySweepTargetN100()
    entropySweepTargetN500()
}

// interested in:
// entropy for sweeping item size items compared to entropy for full randomness for a high target
// entropy for sweeping target length with fixed size

// Entropy could be measured in two different ways:
// same set chosen every time (this is what we are doing)
// same song chosen across times

private fun Iterable<Pair<Int, Pair<Double, Double>>>.write(name: String) {
    csvWriter().open(name) {
        writeRow(listOf("N", "RGLI", "Random"))
        forEach { row ->
            writeRow(row.first, row.second.first, row.second.second)
        }
    }
}

private fun entropySweepNTarget1H() {
    val result = sweepItems(60 * 60.0)

    result.write("sweepNTarget1H.csv")
}

private fun entropySweepNTarget2H() {
    val result = sweepItems(60 * 60.0 * 2)

    result.write("sweepNTarget2H.csv")
}

private fun entropySweep20NTarget1H() {
    val result = sweepItems(60 * 60.0, 10..30 step 1)

    result.write("sweep20NTarget1H.csv")
}

private fun entropySweepTargetN20() {
    val result = sweepTarget(20, 2000..6000 step 100)

    result.write("sweepTargetN20.csv")
}

private fun entropySweepTargetN100() {
    val result = sweepTarget(100, 22000..25000 step 100)

    result.write("sweepTargetN100.csv")
}

private fun entropySweepTargetN500() {
    val result = sweepTarget(500)

    result.write("entropySweepTargetN500.csv")
}


private fun entropyTesting() {
    val rand = Random(38462894653)
    val items = 200
    val musics = getMusic(items, rand).map(Music::toItem)

    val (rgliS, randS) = benchmark {
        computeEntropyFor(musics, 3600.0, 50000)
    }
    println("RGLI entropy: $rgliS bits")
    println("Random entropy: $randS bits")
}

private fun sweepItems(target: Double, range: IntProgression = (10..500 step 10), trials: Int = 100) = range.map { num ->
    println("On step $num")
    val items = getMusic(num).map(Music::toItem)
    num to computeEntropyFor(items, target, trials)
}

private fun sweepTarget(itemsCount: Int, range: IntProgression = (60..3600 step 100), trials: Int = 100) = range.map { num ->
    println("On step $num")
    val items = getMusic(itemsCount).map(Music::toItem)
    num to computeEntropyFor(items, num.toDouble(), trials)
}

private fun computeEntropyFor(
    items: List<Item>,
    target: Double,
    trials: Int,
): Pair<Double, Double> {
    val rgliSSP = RGLISSP()

    val (rgli, random) = runBlocking {
        val rgliJob = async {
            withContext(Dispatchers.Default) {
                (1..trials).map {
                    rgliSSP.solveSSP(items, target).items.toSet()
                }.entropyEven()
            }
        }
        val randomJob = async {
            withContext(Dispatchers.Default) {
                (1..trials).map {
                    RandomSSP.solveSSP(items.shuffled(), target).items.toSet()
                }.entropyEven()
            }
        }

        val one = rgliJob.await()
        val two = randomJob.await()
        one to two
    }

    return rgli to random
}

data class Music(
    val name: String,
    val length: Double,
    val path: String,
) {
    fun toItem() = Item(this, length)
}

private fun getMusic(count: Int, random: Random = Random.Default): List<Music> {
    return File("~/Music")
        .walkTopDown()
        .filter {
            it.isFile && it.extension == "mp3"
        }
        .map { file ->
            val fileFormat = AudioSystem.getAudioFileFormat(file)
            val duration = if (fileFormat is TAudioFileFormat) {
                val properties: Map<*, *> = fileFormat.properties()
                val key = "duration"
                val microseconds = properties[key] as Long?
                (microseconds!! / 1000.0 / 1000.0)
            } else {
                throw UnsupportedAudioFileException()
            }

            Music(
                file.name,
                duration,
                file.path,
            )
        }
        .filter {
            it.length in (1.0 * 60)..(20.0 * 60) && random.nextDouble() < 0.1
        }
        .take(count)
        .toList()
}

private fun <T> Iterable<T>.entropy(probabilityDistribution: (T) -> Double): Double =
    toSet().sumOf { item ->
        val prob = probabilityDistribution(item)
        -prob * log2(prob)
    }

fun <T> Collection<T>.entropyEven() = entropy { item ->
    val total = count { item == it }
    total.toDouble() / size.toDouble()
}

private fun distributionTest() = runBlocking {
    val avgSongLength = 60.0 * 3.0
    val songStandardDeviation = 60.0
    val rand = java.util.Random()
    var id = 0
    val nums = buildList {
        repeat(500) {
            add(
                Item(
                    id++,
                    rand.nextGaussian(avgSongLength, songStandardDeviation).coerceAtLeast(60.0)
                )
            )
        }
    }

    val target = 60.0 * 60.0 * 24.0

    val result = benchmark { RGLISSP().solveSSP(nums, target) }

    val resultSum = result.sum
    println("Result Sum: $resultSum")
    println("off by ${target - resultSum} seconds")
    println("elements: $result")
}

data class Solution(
    val items: List<Item> = emptyList(),
    val sum: Double = items.sumOf(Item::length),
) {
    fun add(item: Item) = Solution(items + item, sum + item.length)
    fun isCloseEnough(delta: Double) = delta < 0.25
    fun replace(old: Item, new: Item) = Solution(items.map { if (it == old) new else it })
}

data class Item(
    val id: Any,
    val length: Double,
)

fun interface SSPAlgorithm {
    fun solveSSP(items: List<Item>, target: Double): Solution
}

object RandomSSP : SSPAlgorithm {
    override fun solveSSP(items: List<Item>, target: Double): Solution {
        return items.fold(Solution()) { solution, item ->
            val new = solution.add(item)
            if (new.sum <= target) new else solution
        }
    }
}

class RGLISSP(val trials: Int = 50) : SSPAlgorithm {
    override fun solveSSP(items: List<Item>, target: Double): Solution {
        var best = Solution()
        repeat(trials) {
            val shuffled = items.shuffled()

            val initial = shuffled.fold(Solution()) { acc, item ->
                val new = acc.add(item)
                if (new.sum <= target) new else acc
            }

            val localImprovement = initial.items.shuffled().fold(initial) { solution, item ->
                val delta = target - solution.sum
                if (solution.isCloseEnough(delta)) return@fold solution

                val replacement = (items - solution.items.toSet()).filter {
                    it.length > item.length && it.length - item.length <= delta
                }.maxByOrNull(Item::length) ?: return@fold solution

                solution.replace(item, replacement)
            }

            if (localImprovement.sum > best.sum) {
                best = localImprovement
            }
            if (best.sum == 0.0) {
                return best
            }
        }

        return best
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce dead air time at end of block scheduling
3 participants