Ideas for some strategy optimizations #3932

hgoldstein95 · 2024-03-21T01:03:27Z

I've been collecting feedback using Tyche lately, and my users have pointed out a few built-in strategies that seem like they may present opportunities for optimization.

Lists

The lists strategy tends to give up fairly frequently, at least relative to what seems normal for a totally vanilla strategy. Running

@given(lists(integers()))
def test_noop_lists(l):
    pass

reliably results in around 5% discards, which isn't a ton, but that inefficiency propagates to everything else that uses lists (which is a huge proportion of strategies used in practice, I'd wager). If it's possible to fix, I think it's probably worth it.

Timezone Keys

The timezone keys strategy produces a lot of duplicates, especially when allow_prefix=False.

@given(timezone_keys(allow_prefix=False))
def test_noop_tzkeys(tzk):
    pass

produces around 30% duplicates. Not sure if this one is super fixable, since I think there are only 1024 keys, but it still seems like we could do better.

This suggests a potentially larger design conversation—if someone runs something like

@given(timezone_keys(allow_prefix=False))
@settings(max_examples=1024)
def test_noop_tzkeys(tzk):
   pass

we should theoretically be able to switch to enumeration 100% of the time. This can be a huge optimization in some cases and catch big issues in others. It's easy to estimate the support of a strategy (or at least, I think it should be in many cases) so it may be cool have a rule that says strategies should exhaustively enumerate if we estimate that the support is smaller than max_examples.

The text was updated successfully, but these errors were encountered:

tybug · 2024-03-21T02:48:43Z

Those discards are actually mostly strategy-independent, and come from our logic of capping the size of early inputs. In our first roughly 10% of inputs, if we generate something too large, we'll throw it away. This shows up as overruns because we stop it early by setting a low max_length of the ConjectureData:

hypothesis/hypothesis-python/src/hypothesis/internal/conjecture/engine.py

Line 775 in 98634bd

max_length = min(len(prefix) + minimal_extension * 10, BUFFER_SIZE)

I don't know whether this behavior is ideal, but thankfully it at least won't compound in the way you were afraid of 😅. Some reasoning behind the change here: #2219.

we should theoretically be able to switch to enumeration 100% of the time.

We actually already try to enumerate all the time! More specifically, we track what inputs we previously generated, and avoid generating them again. If we exhaust this choice tree (called DataTree internally), we terminate early.

This deduplication used to happen at the bitstream level, but now happens at the ir level — one of the big selling points of the ir (#3921). ex: [True, 2, True, 5, False] represents the list [2, 5] from st.lists(st.integers()), where the boolean draws are the list asking if it should generate another element.

This deduplication is better than it used to be, because while neither bitstream ↦ input nor ir ↦ input is injective, the latter is much more injective than the former. But we're still in the midst of the ir migration and haven't seen the full benefits yet. My intuition is timezone_keys — which is basically sampled_from internally — should be perfectly tracked by the ir....but I haven't looked at the details of sampled_from to know for sure.

It's great to have an example of a strategy, timezone_keys(allow_prefix=False), which should have good deduplication, but doesn't — thanks!

Here's a neat illustration of our duplication tracking:

from collections import Counter
seen = []
@given(st.integers(0, 50))
@settings(database=None)
def f(n):
    seen.append(n)
f()
print(len(set(seen)))
print(Counter(seen))

51
Counter({49: 3, 10: 3, 26: 3, 44: 2, 29: 2, 41: 2, 16: 2, 15: 2, 28: 2, 24: 2, 0: 1, 5: 1, 43: 1, 45: 1, 27: 1, 47: 1, 39: 1, 3: 1, 2: 1, 9: 1, 8: 1, 17: 1, 4: 1, 42: 1, 13: 1, 35: 1, 23: 1, 32: 1, 31: 1, 18: 1, 1: 1, 50: 1, 38: 1, 33: 1, 21: 1, 7: 1, 19: 1, 22: 1, 12: 1, 40: 1, 20: 1, 25: 1, 48: 1, 37: 1, 11: 1, 6: 1, 14: 1, 30: 1, 46: 1, 36: 1, 34: 1})

My guess is 49 is duplicated because we upweight near-endpoints on integers, outside of the ir layer (something I'm looking to improve!). I actually have no clue why other numbers like 10 or 26 are overrepresented, though. Something to look into...

Zac-HD · 2024-03-21T05:06:11Z

timezone_keys(allow_prefix=False) is literally just a sampled_from(), which should be bijective with the IR via draw_integer() if there are no filters on it. Weird 🤔

More generally, thanks for the report, and thanks Liam for covering what I'd say! I suspect that a bunch of this will go away naturally by the time we finish migrating to the IR, but we should definitely track such issues to make sure we fix them.

hgoldstein95 · 2024-03-21T18:30:41Z

Thanks for all of that context, both of you!

One thing I want to clarify: when I say "enumeration" I mean something slightly different from "remembering what we've generated and not generating that again". If you know ahead of time that you'll want every possible value from a strategy, you should be able to (modulo some implementation details) just list them off, one by one, without making random choices at all. This is much more efficient and effective than generating randomly and remembering previously generated values: if you had 1000 possible values, it'd take 1000 samples to enumerate all of them but ~7000 in expectation if you were generating randomly (assuming a uniform distribution, which Hypothesis doesn't provide). Does that make sense? Or did I misunderstand what you were getting at?

Based on what you're both saying, it seems like the timezone_keys issue is an unexpected one, whereas lists is expected, if a little unfortunate. Does that seem right? Should I keep submitting reports like these as they come up?

tybug · 2024-03-21T20:03:33Z

if you had 1000 possible values, it'd take 1000 samples to enumerate all of them but ~7000 in expectation if you were generating randomly

This makes sense! Viewed through this lens, Hypothesis only sort of enumerates. When generating a new value (DataTree.generate_novel_prefix internally), we try rejection sampling first: given a draw_integer(0, 1000) ir node, we randomly and uniformly generate a value in (0, 1000), resampling as necessary. If we ran this to completion, our EV is what you stated. But what we actually do is rejection sample 10 times, and if all 10 fail, exhaustively enumerate (to balance memory and randomness, enumerate only the first 100 children then sample from those). The intuition here is to try hard to satisfy the natural requested distribution, but fall back to enumeration if we're at the tail end and things get too difficult. I dunno what the EV of this is, but it feels around 2n? Not great, but not as bad as it could be, and avoids the tail end pitfall. 10 retries was a somewhat arbitrary choice here, fwiw.

We could potentially take max_examples into account when deciding when to enumerate...but I would be cautious of doing so. We use our max example budget for things other than generating novel elements, like optimizing target(), replaying from the databases, or mutating old inputs. Maybe you would argue we shouldn't be doing any of those things if we could exhaustively enumerate instead?

Based on what you're both saying, it seems like the timezone_keys issue is an unexpected one, whereas lists is expected, if a little unfortunate. Does that seem right? Should I keep submitting reports like these as they come up?

That would be my analysis 👍. And please do keep submitting reports of strategy inefficiencies, either in discarding/overruns or in duplicates! Some of them probably would have been improved in time anyway, but I suspect there are a fair number of strategies which will require manual tweaking even after the ir. Either because they were always inefficient and nobody noticed, or we intentionally chose an inefficient generation for e.g. good shrinking behavior which has since been alleviated by the ir.

hgoldstein95 · 2024-03-26T01:37:41Z

Yeah, I think I would argue that if we can conservatively conclude that exhaustive enumeration would require fewer inputs than our limit, we should just fall back to that — I don't see a downside as long as that estimate is accurate.

tybug · 2024-04-26T21:57:55Z

An update: timezone_keys duplicates are from generate_mutations_from, which still operates on sub-ir examples and bypasses datatree-based duplicate detection. This should be fixed once we migrate generate_mutations_from to the ir, which is in turn easier once we finish the shrinker migration (pr soon!) and remove sub-ir examples.

I expect lots of other strategies are similarly over-duplicated due to this, including vanilla integers.

Zac-HD added performance go faster! use less memory! enhancement it's not broken, but we want it to be better labels Mar 21, 2024

Zac-HD mentioned this issue Mar 21, 2024

Track and use maximum number of distinct elements a strategy can produce #2035

Closed

tybug mentioned this issue May 29, 2024

Remove sub-ir examples #4007

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas for some strategy optimizations #3932

Ideas for some strategy optimizations #3932

hgoldstein95 commented Mar 21, 2024

tybug commented Mar 21, 2024 •

edited

Zac-HD commented Mar 21, 2024

hgoldstein95 commented Mar 21, 2024

tybug commented Mar 21, 2024 •

edited

hgoldstein95 commented Mar 26, 2024

tybug commented Apr 26, 2024 •

edited

Ideas for some strategy optimizations #3932

Ideas for some strategy optimizations #3932

Comments

hgoldstein95 commented Mar 21, 2024

Lists

Timezone Keys

tybug commented Mar 21, 2024 • edited

Zac-HD commented Mar 21, 2024

hgoldstein95 commented Mar 21, 2024

tybug commented Mar 21, 2024 • edited

hgoldstein95 commented Mar 26, 2024

tybug commented Apr 26, 2024 • edited

tybug commented Mar 21, 2024 •

edited

tybug commented Mar 21, 2024 •

edited

tybug commented Apr 26, 2024 •

edited