You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A partition is a subset of the data corresponding to a given value of the aggregation criterion. Usually we want to aggregate each partition separately. For example, if we count visits to restaurants, the visits for one particular restaurant are a single partition, and the count of visits to that restaurant would be the aggregate for that partition.
Public partitions are partition keys that are publicly known and hence don’t leak any user information. An example of public partitions could be week days.
DPEngine.aggregate is API function that performs DP aggregation. public_partitions is an argument of DPEngine.aggregate(). It might be Python iterable (when it's small enough to fit in memory and to efficiently distributed among workers) or distributed collection (PCollection for beam, RDD for spark).
In short, public partition selection consists of 2 stages:
Filtering out all partition key, which are not in public_partitions (code which does this).
Addding "zero" partitions for all elements of public_partitions which are not in input data (code which does this).
Downsides of the current state.
Let’s consider the case when partitions are cartesian products of multiple dimensions, for example (country, date).
The user needs to do generation of cross-join: that’s additional steps from users, so more possibilities for bugs and this cross-join might be very large (as a result performance impact).
What can be done better?
The user can specify values of each dimensions, and PipelineDP internally can do join: this would be easier to use for users and it might be done much more effectively from performance point of view inside PipelideDP.
Goals
Allow to specify public_partitions as a product of different dimensions. Steps to implement (it might be split in sevaral PRs)
1.Device a nice API for specifying public_paritions in arguments of DPEngine.aggregate (a separate argument, or maybe some class object which specifies product). For beginning we can assume that dimensions values are Python iterable.
2. Implement steps 1 & 2 of public_partitiosn algorithm (see in section above).
3. Propagate these public partitoins in all places where they used (e.g. in Beam API, e.g in Spark API).
The text was updated successfully, but these errors were encountered:
Context
Note: here is more about terminology.
Definitions (from terminology page)
A partition is a subset of the data corresponding to a given value of the aggregation criterion. Usually we want to aggregate each partition separately. For example, if we count visits to restaurants, the visits for one particular restaurant are a single partition, and the count of visits to that restaurant would be the aggregate for that partition.
Public partitions are partition keys that are publicly known and hence don’t leak any user information. An example of public partitions could be week days.
DPEngine.aggregate is API function that performs DP aggregation.
public_partitions
is an argument ofDPEngine.aggregate()
. It might be Python iterable (when it's small enough to fit in memory and to efficiently distributed among workers) or distributed collection (PCollection for beam, RDD for spark).In short, public partition selection consists of 2 stages:
public_partitions
(code which does this).public_partitions
which are not in input data (code which does this).Downsides of the current state.
Let’s consider the case when partitions are cartesian products of multiple dimensions, for example (country, date).
The user needs to do generation of cross-join: that’s additional steps from users, so more possibilities for bugs and this cross-join might be very large (as a result performance impact).
What can be done better?
The user can specify values of each dimensions, and PipelineDP internally can do join: this would be easier to use for users and it might be done much more effectively from performance point of view inside PipelideDP.
Goals
Allow to specify
public_partitions
as a product of different dimensions. Steps to implement (it might be split in sevaral PRs)1.Device a nice API for specifying public_paritions in arguments of
DPEngine.aggregate
(a separate argument, or maybe some class object which specifies product). For beginning we can assume that dimensions values are Python iterable.2. Implement steps 1 & 2 of public_partitiosn algorithm (see in section above).
3. Propagate these public partitoins in all places where they used (e.g. in Beam API, e.g in Spark API).
The text was updated successfully, but these errors were encountered: