Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make wchoose and windex use relative weights (normalize by default) #6311

Open
Asmatzaile opened this issue May 15, 2024 · 26 comments · May be fixed by #6322
Open

Make wchoose and windex use relative weights (normalize by default) #6311

Asmatzaile opened this issue May 15, 2024 · 26 comments · May be fixed by #6322

Comments

@Asmatzaile
Copy link

Motivation

wchoose and windex don't normalize; that makes it easy to get unexpected results.

(
var list, pick;
list= [\red, \green, \blue];
pick = list.wchoose([100,200,300]);
)

Looking at the above code, one would expect to get \blue thrice as often and \green twice as often as \red. In reality,\red is the only element that is picked.

The correct option would have been to use list.wchoose([100,100,100].normalizeSum). In practice, all lists with weights that are not normalized (and I assume that they are the majority) will need to have .normalizeSum applied.

Description of Proposed Feature

The proposal is to make those functions normalize by default, as other languages as python. If efficiency is a concern, a flag can be added for the arguably rare cases where the sum of the list is already equal to one.

@smoge
Copy link
Contributor

smoge commented May 15, 2024

For reference, one of the common modules for this in Haskell is also normalized by default $\frac{w}{\text{sum of all weights}}$

Weighted list
A list of pairs (w, a), where w is the weight of element a. The probability of an element getting into the first position is equal by its weight divided by the sum of all weights, and the probability of getting into a position other than the first is equal to the probability of getting in the first position when all elements in prior positions have been removed from the weighted list.

https://hackage.haskell.org/package/random-extras-0.19/docs/Data-Random-Shuffle-Weighted.html

@scztt
Copy link
Contributor

scztt commented May 15, 2024

The documentation states that weights must normalize to 1.0. The behavior if they do not is not defined in the documentation and not exactly expected (final item(s) in the list either have much higher probability or dont occur at all, depending on the sum of the weights).

As it is, user code must either (a) call .normalize on weights, or (b) use weights that naturally sum to 1.0 - almost all does (a). Performance will be slightly slower for old code in either of these cases, but the output will not change. However, new code doesn't need .normalize, and normalizing internally will be much faster because it can be done in place and doesn't require a full array copy.

In short: we eliminate the possibility of non-obvious invalid behavior and make this faster for new code.

@smoge
Copy link
Contributor

smoge commented May 16, 2024

I thought of some cases where nobody loses performance. Does anyone want the honor of submitting the first draft, or should I go ahead?

I don't care who does it, I'm just volunteering. I'm indifferent if you just change de code directly.

Would you like to be one of the reviewers, @scztt? That would be cool.

cheers

@Asmatzaile
Copy link
Author

@smoge go ahead for me! I don't know how to

normaliz[e] internally [so that it] doesn't require a full array copy

@telephon
Copy link
Member

telephon commented May 17, 2024

While I really like the idea, I wonder how we can make this efficient for high n:

(
var n = 2048;
var weights = Array.fill(n, { |i| i.squared.rand }).normalizeSum;
var freqs = Array.fill(n, { exprand(300, 8000) });
fork {
     loop {
          0.01.wait;
          (freq: freqs.wchoose(weights), sustain: 0.01).play;
     }
}
)

@smoge
Copy link
Contributor

smoge commented May 18, 2024

While I really like the idea, I wonder how we can make this efficient for high n:

windex is already a primitive, if performance is the priority, this could be an option [See EDIT]

https://en.cppreference.com/w/cpp/numeric/random/discrete_distribution

EDIT: I'm not aware if there is some guideline when it would be necessary to create new primitives. I think it's not necessary, but others will know better than me.

In terms of the best algorithm for generating random numbers with a specified discrete distribution, it seems the consensus is the 'alias method' (the one analyzed by Knuth)

A site that explains the alias method in detail: http://www.keithschwarz.com/darts-dice-coins/

code snippet in cpp: https://gist.github.com/Liam0205/0b5786e9bfc73e75eb8180b5400cd1f8

reference: Knuth, The Art of Computer Programming, Vol 2: Seminumerical Algorithms. Sect 3.4.1.

@telephon
Copy link
Member

telephon commented May 19, 2024

The bottleneck is not wchoose but normalizeSum. If you normalize for each single random value, this is an unnecessary overhead, and if you pass a normalized list, it is a wasted effort and may even result in (small) floating point differences.

I would say it is better to make an extra method wchooseN which normalizes.

wchooseN { |weights| ^this.at(weights.windexN) } 
windexN { ^this.normalizeSum.windex }

@smoge
Copy link
Contributor

smoge commented May 19, 2024

I fear it might be "overkill" to solve the problem, but I will leave it here for consideration.

Summing the list was also expensive, so I tried std::accumulate which is a left fold, analogous to foldl in Haskell. It seems to work. std::accumulate is used only once in supercollider, but it should be exposed to the language maybe [unrelated idea].

The result is quite efficient, with huge lists and tiny elapsed times:

void normalize_array(double* arr, size_t n) {
    double sum = std::accumulate(arr, arr + n, 0.0);
    if (sum != 0) {
        double inv_sum = 1.0 / sum;
        for (size_t i = 0; i < n; ++i) arr[i] *= inv_sum;
    }
}

https://gist.github.com/smoge/c998c1a9cc3dc874f2871b8ecd2a3279

// Size: 1000
// Sum before normalization: 48108.5
// Sum after normalization: 1
// Elapsed time: 2.0227e-05s

// Size: 10000
// Sum before normalization: 500814
// Sum after normalization: 1
// Elapsed time: 0.000201519s

// Size: 100000
// Sum before normalization: 5.00622e+06
// Sum after normalization: 1
// Elapsed time: 0.002018s

// Size: 1000000
// Sum before normalization: 5.00258e+07
// Sum after normalization: 1
// Elapsed time: 0.0128995s

@telephon
Copy link
Member

telephon commented May 19, 2024

Yes, good as an improvement of normalizeSum – it just requires that the array consists of numbers only. Maybe one could do it for FloatArray, or bail out if a non-number suddenly turns up in the list and then do it in sclang instead.

For the current issue, I would say that it is better to go with an extra pair of methods.

But if you like, it would be good to have an extra one for an optimised normalizedSum. It is a classical candidate for this, only the fact that normal arrays may have different objects is a minor obstacle to care for.

@telephon
Copy link
Member

Broader question: From a language design perspective, what would mean a string in an Array.normalizeSum? Does it work right now?

well, generally, normalizeSum = [x_1 / (x_1 + x_2 + ...), x2 / (x_1 + x_2 + ...), ... ]. So that we have two things:

  • the sum of the elements
  • the division of each element by the sum

What would be the sum of the elements of a string? Characters can't be added, but suppose they could, they may form a string
Then what could it mean to divide a character by a string?

For other things, the above make sense more easily, for example for UGens.

@smoge
Copy link
Contributor

smoge commented May 19, 2024

windex checks if all items are numbers:

    for (i = 0; i < size; ++i) {
        err = getIndexedDouble(obj, i, &w);
        if (err)
            return err;
        sum += w;
        if (sum >= r) {
            j = i;
            break;
        }
    }

There is a primitive prArrayNormalizeSum, is it just used for FloatArray?

I'm just surprised because I've never seen .normalizeSum being used in other contexts...

@telephon
Copy link
Member

Ah yes – I should have read the code. It already does more or less what I thought it should do.

I'm just surprised because I've never seen .normalizeSum being used in other contexts...

Well, I suppose the point is never what we have seen, but what is made possible.

But windex currently throws errors and is rather "conservative". We use TWindex where UGens are concerned, which is more efficient than directly building the graph.

@telephon
Copy link
Member

telephon commented May 20, 2024

So @Asmatzaile – would you like to reopen another issue or rename this one? We should not propose to chenge windex and wchoose, but instead to add two new methods.

@capital-G
Copy link
Contributor

So @Asmatzaile – would you like to reopen another issue or rename this one? We should not propose to chenge windex and wchoose, but instead to add two new methods.

I think we should actually replace the existing functions with the implicit scaling ones. I think implicit scaling is more user friendly. Arrays that don't sum to 1 aren't stochastic distributions in the first place, and it's still unclear to me what sclang actually does with them. Instead of unclear behavior or throwing an error, the function could try to recover into "deterministic behavior" by scaling.
Of course, we could add an argument to turn off automatic scaling (but for what purpose exactly? Is checking that the sum of an array is 1 so expensive? if it doesn't add up to 1, wchoose still won't work as intended).

If we'd have a proper logging system we could also yield a warning so an implicit scaling becomes noticed.

@smoge
Copy link
Contributor

smoge commented May 20, 2024

Is checking that the sum of an array is 1 so expensive?

In SCLang, checking this would be quite expensive, for those large lists. (I'm not sure, but I think it was something like 15 times slower). That's why I tried to compare it with cpp.

The only special case I'm not sure what to do is a list that adds to zero, and how to calculate different tolerances for small and very large lists (because of floating-point errors in small and huge lists), which would cause problems here (division by zero). Especially since negative numbers are allowed, which is not the normal usage for normalizeSum.

Of course, real benchmarks would need to be inside SClang.

How would you deal with this? Glad to receive feedback.

void normalize_array(std::vector<double>& arr) {
  const double sum = std::accumulate(arr.begin(), arr.end(), 0.0);
  if (sum <= std::numeric_limits<double>::epsilon()) {
    throw std::runtime_error("Cannot normalize array: sum too close to zero.");
  }
  double inv_sum = 1.0 / sum;
  std::transform(arr.begin(), arr.end(), arr.begin(), [inv_sum](double val) { return val * inv_sum; });
}

The benchmark I posted before was adding the time to generate the array and the time to normalizeSum them. Just normalizeSum benchmark confirms it's a linear complexity $O(n)$ operation. This output separates the two:

Size: 1000
Normalization time: 1.593e-06s
Size: 10000
Normalization time: 1.5404e-05s
Size: 100000
Normalization time: 0.000155685s
Size: 1000000
Normalization time: 0.00181367s

@telephon
Copy link
Member

My issue is that it takes away a possibility of efficiently calling wchoose on a huge array (or at least that it is a waste of energy). This is why I propesed separate methods.

you could maybe benchmark with this?

(
var n = 10000;
var weights = Array.fill(n, { |i| i.squared.rand }).normalizeSum;
var freqs = Array.fill(n, { exprand(300, 8000) });
bench { 1000.do { freqs.wchoose(weights) } };
)

(
var n = 10000;
var weights = Array.fill(n, { |i| i.squared.rand });
var freqs = Array.fill(n, { exprand(300, 8000) });
bench { 1000.do { freqs.wchoose(weights.normalizeSum) } };
)

@Asmatzaile
Copy link
Author

Asmatzaile commented May 21, 2024

I am trying to find a discussion around the original implementation of the wchoose and windex functions but it seems like they were there from very early on.
FWIW, some discussion about negative numbers and random.choice in python [edit: here python/cpython#81805]

Are there any use cases where the user already has a normalized list without calling .normalizeSum? If not, I really don't see the value of keeping wchoose as it is. It's not a breaking change for older programs and a better API for newer ones. And why not, for users only wanting to normalize once and storing the normalized array for efficiency, adding an optional argument for not normalizing seems like a good idea.

@telephon
Copy link
Member

Are there any use cases where the user already has a normalized list without calling .normalizeSum?

yes, it's the normal case, where you just write out a normalizes array, or you calculate it by hand and paste it in. This is at least what I have seen most, of course we don't know what is out there.

But while from the persective of practical use, I'd love to have a normalising wchoose, I still would prefer not to touch it and just add a letter to the end of the method name if I want it normalized. Then not everyone has to pay for a feature that is not needed in every case?

@cdbzb
Copy link
Contributor

cdbzb commented May 21, 2024

Perhaps a good compromise would be to add a method for those users who don't want the array normalized. I think the best default behavior is the one that is least likely to produce unexpected results. A sophisticated user seeking extra performance could reach for windexFast or somesuch.

@telephon
Copy link
Member

Just to be clear, we mean a method for those users who don't want the array normalized again for each new random value.

So the choice could be either:

  1. wchoose is normalising wchoose1 non-normalising
  2. wchoose is unchanged, the normalizing wchooseN is added

To decide, benchmarks would be needed, I think, because wchoose is in heavy use already.

@Asmatzaile
Copy link
Author

How about adding optional arguments to .choose: (weights=nil, n_weights=nil)?
When n_weights!=nil, it could call to wchoose[nweights]; else if weights != nil, wchoose[weights.normalizeSum], and else, the usual choose function.
That would keep code that uses wchoose with the same good performance

@cdbzb
Copy link
Contributor

cdbzb commented May 22, 2024

I like this. One less method to have to remember.

@telephon
Copy link
Member

Interesting idea, a little unusual, but possible.
You'd then write
.wchoose(n_weights:[1, 2, 3])
instead of
.wchooseN([1, 2, 3])
right?

@Asmatzaile
Copy link
Author

Asmatzaile commented May 23, 2024

What I'm suggesting now is to keep .wchoose as it is and change .choose instead.
[/one, /two, /three].choose would work like now.
But one could also do [/one, /two, /three].choose([1,2,3]) for relative weights and the equivalent [/one, /two, /three].choose([n_weights:[1/6,1/3,0.5]) for normalized weights.

@telephon
Copy link
Member

telephon commented May 23, 2024

ok, that is perhaps really the best solution. But then, why not just pass only one array that will be normalised, and if you want to be efficient you use wchoose? Semantically slightly bent, but it is not uncommon that the most used method is easy but comes at a little cost.

(it is easier to add a w to choose than a keyword argument n_weights)

@Asmatzaile
Copy link
Author

Cool! Will make a PR later.

@Asmatzaile Asmatzaile linked a pull request May 23, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants