-
-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Series generation with serial dependence #1605
Comments
Pandera strategies are currently quite limited, as you've experienced. The limitation is sort of bounded by the fact that it's leveraging the hypothesis
Yes, so #561 is the issue for improving this in pandera, I just haven't had the time to work on this because it'll pretty much involve a re-write of the pandas_strategy module. I consider this issue, #1220, and #1275 to be problems to be addressed by the re-write (#1275 sounds pretty hard to implement tho, I'd maybe keep that out of the design and rely on docs/recipes on how to generate strategies with a fixed column based on the data generated from another strategy). If you have to time/capacity, would you be able to chime in on #561 with a high-level set of requirements and (ideally) a code sketch of how this might be implemented in pandera? It would involve departing from From my understanding, we want:
|
Is your feature request related to a problem? Please describe.
I've been trying - and failing - to generate dataframes with a Pandera strategy that will create a
date
column with values frompd.date_range()
. I can generate a series viahypothesis
directly:However, I can't create a Pandera strategy. The best I could come up with is this:
alternatively:
Neither of the above work, because Pandera assumes the elements are individually generated.
I have also tried subclassing
pa.Column
to overwrite the.strategy()
and.strategy_component()
to return a customhs.builds(...)
strategy, but it fails because these arehypothesis.extra.pandas.impl.column()
passed tohypothesis.extra.pandas.impl.dataframe()
... which a custom strategy misses. Oof.I also ran into #1220 constantly (on 0.18.3), not sure if it's fixed for 0.19.0b3 - didn't check that yet.
Describe the solution you'd like
Ideally, I would like the ability to generate a whole series with a custom function, or at least with the
hs.builds
function.I've seen #561, which might be the more proper fix. A shorter-term solution would be to allow custom generation in another code path (though with the layers of abstraction, this might be hard to accomplish...).
Since Hypothesis requires the
.dataframe()
to take columns, perhaps any custom columns could be generated alongside it? The custom generator function would have to be given the length of series to generate. More complicated cases would be handled by #561 then.Describe alternatives you've considered
See above in problem description.
Additional context
Currently, I have a check for data frequency (i.e. if data is daily, weekly, etc.) that I want to generate valid data for.
However, there are more complicated cases, such as ensuring we have ALL dates being contiguous within that frequency (from min to max). Without either this or #561 we can't generate things from the schema.
#1275 is also relevant - if I could generate a global column of timestamps (with or without pandera schema), and use that column to be "joinable" with other Pandera-schema-defined dataframes, that would cover most of my use cases as well.
The text was updated successfully, but these errors were encountered: