-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SPEC7: Seeding pseudo-random number generation #180
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea to try to uniformize all that 👍
(I suppose you meant SPEC and not NEP.)
spec-0007/index.md
Outdated
1. Because `np.random.seed` is so often used in practice, no seed means | ||
using the global `RandomState` object, `np.random.mtrand._rand`. | ||
2. (Option a) When a seed is provided, a `RandomState` object is initialized with that seed. | ||
3. (Option b) When a seed is provided, a `Generator` object is initialized with that seed. | ||
4. If an instance of `RandomState` is provided, it is used as-is. | ||
5. If an instance of `Generator` is provided, it is used as-is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is describing the current state in some libraries. But is it where we want to see this 10 years from now?
I am personally against any global state and advertising of any "legacy" behaviours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel we may want to think about a new keyword argument instead, that adopts recommended best practices instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I see how that could help.
If you add a rng
to a function which has seed
or random_state
we don't avoid raising some warning about deprecation, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The big problem with random_state
is that it allows for None
, which then grabs global state. So, that will always conflict with an rng=None
kwarg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, the ideal API (using todays tool at least) would be that random_state=None
would give you np.random.default_rng()
.
There is also the crazy thought, which I kind of like, from @ilayn: do not accept integers, only a Generator
(or other object). Point being, you must provide a RNG if you want any reproducibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can do that, though, because it would be a backward incompatible change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accepting only Generator objects could work, but we still have to deal with None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both integer and global state behaviour are BC issues yep.
It would be interesting to see how in real life Generator
was painful to move to, taking into account a large sample of projects, folks, etc.
i.e. sometimes I feel like we are concerning ourselves too much about BC while for users it might be super easy and accepted to make the change. It's mostly a communication issue to me. Always taking my backend example, but there they do break (intentionally or not) production code like all the time. There are complains yep, but it's mostly ok and the ball is rolling and these projects are loved and praised (FastAPI is known for doing this often and seen as THE thing.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and don't expect the same favors from commercial code. It's shackles that we choose to put on, sometimes at great time cost to (oftentimes volunteer) developers.
There's a deprecation strategy that can work to migrate from |
To make sure I understand, this will change the return values of functions ( |
Yes, it's a deprecation strategy, not a backwards-compatibility-preserving strategy. |
@stefanv thanks for starting to summarize that long and complex discussion!
Can you please elaborate on this? It's not all that obvious, because when you're not seeding the first intuition I'd have is "I am not expecting specific results, only random numbers with a given distribution". Since you're kinda steering towards a large amount of churn due to changing names here, I think it's important to be specific under what circumstances there is a backwards compatibility impact. I guess the point here is:
And then there's the question whether this scenarios matter. It may impact exact reproducibility of some scientific result. However that reproducibility was only ever guaranteed when using the same version of the same libraries on the same hardware. I'd suggest finding the most compelling scenario here, that makes it as easy as possible to say that that's not acceptable, and hence we must change from |
The deprecation strategy I outlined does imply a change in semantics of the affected functions above and beyond the change in the precise numbers that come out of them. There are plenty of programs (using |
Right, and my take was that this is a desired outcome from our perspective. |
By far the most common use of seeding is to fix test suites. Most of those will keep running as-is. The failures that arise will be legitimate failures, and could be fixed by playing with the seed, or by making the underlying code more robust. |
I am willing to spend time on the rewriting. The test suite is seriously out-of-date in many places anyways. You can even smell the year from just by reading the comments. |
I've made more explicit the points you mentioned, Ralf. It may benefit from fleshing out even further as we continue to evolve the document. I don't want to tighten things up before we've agreed on a pathway forward! |
Yes, I think so. But I interpreted Ralf's question as whether it was really necessary to go through a deprecation and a name churn to do this instead of just changing what |
Yes, I think that's saying exactly the same thing I was saying in my bullet points higher up. I would add that library code doing this is already broken, because it's not robust to (for example) the end user using |
Why don't we emit a warning from numpy random.seed? |
Because it will create utter havoc in the many valid uses in test suites? |
Isn't that what you want, eventually? |
Phrased differently: once we deprecate global seeding for the ecosystem, what would be the use of np.random.seed? |
Yes. There are plenty of ML programs (in particular) that call
One enormous hurdle at a time, please. 😉 |
Fair enough :) |
There is no plan to do so. Deprecating |
Yes, okay - I agree, this summary and rationale is enough to explain why we cannot stay with That also means that item (a) of my reasoning in scipy/scipy#14322 (comment) is not "in the same ballpark" and hence it seems clear now that we should prefer |
I think that would not prevent from having a user warning. Average users don't read docs and keep copy pasting old code until "something" is getting in their ways. So until it's visible in their code that something is legacy they will keep using that I am afraid. Also reading at the NEP19, to me it's really not clear that the global state would not change. The fate of
|
There seems to be some vague consensus around the deprecation approach. I don't want to run things ahead, but at the same time scikit-image has to make a calculated guess of what to do for its forthcoming release. So, without holding anyone to the fire, I will propose that we make the I would appreciate it if those involved in the discussion would co-author this SPEC (whether by adding your name to the authors list, or by helping to clarify language). If you want to keep a safe distance, advice on how to solidify the thrust of the argument further would also be welcome. Thanks! |
Use `rng` consistently, replacing `random_state` and `seed`. See also scientific-python/specs#180
FWIW, I don't think rng will be hard to teach: users will read it once and get it, especially if we make it a common pattern across the ecosystem:
Rng, the thing we want as input, is already enshrined in the NumPy function name too. prng would have been slightly better from an educational point of view.
|
Ah, sorry Robert, didn't see your message there. |
For what it is worth, I think I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went over the current document and strategy. I am +1.
I don't like the name 😅 but yes this is something users will eventually get with the warnings; and being written all over in NumPy's doc is helping us to sell it.
The array API discussion about random number generation and the differences between the NumPy-style and JAX-style APIs may or may not be relevant here data-apis/array-api#431. I don't know if it matters for seeding specifically, but I also know a lot of the same people on that discussion are already on this one. |
Thanks for this effort! It might be useful to add a Motivation and/or Context section to this spec. I think it makes it a bit clearer why this spec was created and where it came from. |
|
||
### Concepts | ||
|
||
- `BitGenerator`: Generates a stream of pseudo-random bits. The default generator in NumPy (`np.random.default_rng`) uses PCG64. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simplicity, can this reference to BitGenerator
be removed? For this SPEC, the user facing API is around Generator
and RandomState
.
spec-0007/index.md
Outdated
|
||
(1) it avoids naïve seeding strategies, such as using successive integers, via the underlying [SeedSequence](https://numpy.org/doc/stable/reference/random/parallel.html#seedsequence-spawning); | ||
(2) it avoids using global state for seeding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be written in a more positive light?
- Spawns generators for parallel random number generator.
np.random.default_rng(None)
uses unpredictable entropy will be pulled from the OS. There is no global state for seeding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) it avoids naïve seeding strategies, such as using successive integers, via the underlying [SeedSequence](https://numpy.org/doc/stable/reference/random/parallel.html#seedsequence-spawning); | |
(2) it avoids using global state for seeding. | |
(1) it allows safe and powerful parallel streams, via the underlying [SeedSequence](https://numpy.org/doc/stable/reference/random/parallel.html#seedsequence-spawning); | |
(2) it avoids creating implicit global state. |
Not sure how to rewrite (2) in more positive wording. "for seeding" was a distractor there. We're really referring to the implicit global state of np.random.mtrand._rand
that is used by the np.random.random()
et al. convenience functions, and the benefit of the new framework is really that we just don't do that anymore. It fundamentally is something that we "don't do" anymore rather than something that we "do better".
1. Those who use `np.random.seed`. The proposal will do away with that global seeding mechanism, meaning that code that relies on it will, after a certain deprecation period, start seeing a different stream of random numbers than before. | ||
|
||
Such code will, in effect, go from being seeded to being unseeded. | ||
To avoid that from happening, the code will have to be modified to pass in explicitly an `rng` argument on each function call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change should give a FutureWarning
if the user called np.random.seed()
at any time. If they did not call np.random.seed()
then the warning is unnecessary because the stream is not reproducible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, if NumPy can commit to that, that'd be great! I've added it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misunderstood what you meant; you meant the decorator should do this check and warn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry, I tried to fix it a bit in the PR (will update now for the other changes). However, that code does nevertheless reach into NumPy internals, so there is some amount of "commitment" to not break it.
date: 2023-04-19 | ||
author: | ||
- "Stéfan van der Walt <[email protected]>" | ||
- Other participants in the discussion <[email protected]>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Other participants in the discussion <[email protected]>" | |
- "Pamphile Roy <[email protected]>" | |
- "Sebastian Berg <...>" |
@seberg 😉
Under discussion at scipy/scipy#14322