-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make wchoose and windex use relative weights (normalize by default) #6311
Comments
For reference, one of the common modules for this in Haskell is also normalized by default
https://hackage.haskell.org/package/random-extras-0.19/docs/Data-Random-Shuffle-Weighted.html |
The documentation states that weights must normalize to 1.0. The behavior if they do not is not defined in the documentation and not exactly expected (final item(s) in the list either have much higher probability or dont occur at all, depending on the sum of the weights). As it is, user code must either (a) call In short: we eliminate the possibility of non-obvious invalid behavior and make this faster for new code. |
I thought of some cases where nobody loses performance. Does anyone want the honor of submitting the first draft, or should I go ahead? I don't care who does it, I'm just volunteering. I'm indifferent if you just change de code directly. Would you like to be one of the reviewers, @scztt? That would be cool. cheers |
@smoge go ahead for me! I don't know how to
|
While I really like the idea, I wonder how we can make this efficient for high n:
|
https://en.cppreference.com/w/cpp/numeric/random/discrete_distribution EDIT: I'm not aware if there is some guideline when it would be necessary to create new primitives. I think it's not necessary, but others will know better than me. In terms of the best algorithm for generating random numbers with a specified discrete distribution, it seems the consensus is the 'alias method' (the one analyzed by Knuth) A site that explains the alias method in detail: http://www.keithschwarz.com/darts-dice-coins/ code snippet in cpp: https://gist.github.com/Liam0205/0b5786e9bfc73e75eb8180b5400cd1f8 reference: Knuth, The Art of Computer Programming, Vol 2: Seminumerical Algorithms. Sect 3.4.1. |
The bottleneck is not I would say it is better to make an extra method wchooseN { |weights| ^this.at(weights.windexN) }
windexN { ^this.normalizeSum.windex } |
I fear it might be "overkill" to solve the problem, but I will leave it here for consideration. Summing the list was also expensive, so I tried The result is quite efficient, with huge lists and tiny elapsed times: void normalize_array(double* arr, size_t n) {
double sum = std::accumulate(arr, arr + n, 0.0);
if (sum != 0) {
double inv_sum = 1.0 / sum;
for (size_t i = 0; i < n; ++i) arr[i] *= inv_sum;
}
} https://gist.github.com/smoge/c998c1a9cc3dc874f2871b8ecd2a3279 // Size: 1000 // Size: 10000 // Size: 100000 // Size: 1000000 |
Yes, good as an improvement of normalizeSum – it just requires that the array consists of numbers only. Maybe one could do it for FloatArray, or bail out if a non-number suddenly turns up in the list and then do it in sclang instead. For the current issue, I would say that it is better to go with an extra pair of methods. But if you like, it would be good to have an extra one for an optimised normalizedSum. It is a classical candidate for this, only the fact that normal arrays may have different objects is a minor obstacle to care for. |
well, generally,
What would be the sum of the elements of a string? Characters can't be added, but suppose they could, they may form a string For other things, the above make sense more easily, for example for UGens. |
windex checks if all items are numbers:
There is a primitive prArrayNormalizeSum, is it just used for FloatArray? I'm just surprised because I've never seen .normalizeSum being used in other contexts... |
Ah yes – I should have read the code. It already does more or less what I thought it should do.
Well, I suppose the point is never what we have seen, but what is made possible. But |
So @Asmatzaile – would you like to reopen another issue or rename this one? We should not propose to chenge windex and wchoose, but instead to add two new methods. |
I think we should actually replace the existing functions with the implicit scaling ones. I think implicit scaling is more user friendly. Arrays that don't sum to 1 aren't stochastic distributions in the first place, and it's still unclear to me what sclang actually does with them. Instead of unclear behavior or throwing an error, the function could try to recover into "deterministic behavior" by scaling. If we'd have a proper logging system we could also yield a warning so an implicit scaling becomes noticed. |
In SCLang, checking this would be quite expensive, for those large lists. (I'm not sure, but I think it was something like 15 times slower). That's why I tried to compare it with cpp. The only special case I'm not sure what to do is a list that adds to zero, and how to calculate different tolerances for small and very large lists (because of floating-point errors in small and huge lists), which would cause problems here (division by zero). Especially since negative numbers are allowed, which is not the normal usage for normalizeSum. Of course, real benchmarks would need to be inside SClang. How would you deal with this? Glad to receive feedback. void normalize_array(std::vector<double>& arr) {
const double sum = std::accumulate(arr.begin(), arr.end(), 0.0);
if (sum <= std::numeric_limits<double>::epsilon()) {
throw std::runtime_error("Cannot normalize array: sum too close to zero.");
}
double inv_sum = 1.0 / sum;
std::transform(arr.begin(), arr.end(), arr.begin(), [inv_sum](double val) { return val * inv_sum; });
} The benchmark I posted before was adding the time to generate the array and the time to normalizeSum them. Just normalizeSum benchmark confirms it's a linear complexity Size: 1000 |
My issue is that it takes away a possibility of efficiently calling wchoose on a huge array (or at least that it is a waste of energy). This is why I propesed separate methods. you could maybe benchmark with this? (
var n = 10000;
var weights = Array.fill(n, { |i| i.squared.rand }).normalizeSum;
var freqs = Array.fill(n, { exprand(300, 8000) });
bench { 1000.do { freqs.wchoose(weights) } };
)
(
var n = 10000;
var weights = Array.fill(n, { |i| i.squared.rand });
var freqs = Array.fill(n, { exprand(300, 8000) });
bench { 1000.do { freqs.wchoose(weights.normalizeSum) } };
)
|
I am trying to find a discussion around the original implementation of the Are there any use cases where the user already has a normalized list without calling |
yes, it's the normal case, where you just write out a normalizes array, or you calculate it by hand and paste it in. This is at least what I have seen most, of course we don't know what is out there. But while from the persective of practical use, I'd love to have a normalising wchoose, I still would prefer not to touch it and just add a letter to the end of the method name if I want it normalized. Then not everyone has to pay for a feature that is not needed in every case? |
Perhaps a good compromise would be to add a method for those users who don't want the array normalized. I think the best default behavior is the one that is least likely to produce unexpected results. A sophisticated user seeking extra performance could reach for |
Just to be clear, we mean a method for those users who don't want the array normalized again for each new random value. So the choice could be either:
To decide, benchmarks would be needed, I think, because wchoose is in heavy use already. |
How about adding optional arguments to |
I like this. One less method to have to remember. |
Interesting idea, a little unusual, but possible. |
What I'm suggesting now is to keep |
ok, that is perhaps really the best solution. But then, why not just pass only one array that will be normalised, and if you want to be efficient you use wchoose? Semantically slightly bent, but it is not uncommon that the most used method is easy but comes at a little cost. (it is easier to add a |
Cool! Will make a PR later. |
Motivation
wchoose
andwindex
don't normalize; that makes it easy to get unexpected results.Looking at the above code, one would expect to get
\blue
thrice as often and\green
twice as often as\red
. In reality,\red
is the only element that is picked.The correct option would have been to use
list.wchoose([100,100,100].normalizeSum)
. In practice, all lists with weights that are not normalized (and I assume that they are the majority) will need to have.normalizeSum
applied.Description of Proposed Feature
The proposal is to make those functions normalize by default, as other languages as python. If efficiency is a concern, a flag can be added for the arguably rare cases where the sum of the list is already equal to one.
The text was updated successfully, but these errors were encountered: