Public partitions as a cartesian product of dimensions #277

dvadym · 2022-05-17T14:53:54Z

Context

Note: here is more about terminology.

Definitions (from terminology page)

A partition is a subset of the data corresponding to a given value of the aggregation criterion. Usually we want to aggregate each partition separately. For example, if we count visits to restaurants, the visits for one particular restaurant are a single partition, and the count of visits to that restaurant would be the aggregate for that partition.

Public partitions are partition keys that are publicly known and hence don’t leak any user information. An example of public partitions could be week days.

DPEngine.aggregate is API function that performs DP aggregation. public_partitions is an argument of DPEngine.aggregate(). It might be Python iterable (when it's small enough to fit in memory and to efficiently distributed among workers) or distributed collection (PCollection for beam, RDD for spark).

In short, public partition selection consists of 2 stages:

Filtering out all partition key, which are not in public_partitions (code which does this).
Addding "zero" partitions for all elements of public_partitions which are not in input data (code which does this).

Downsides of the current state.

Let’s consider the case when partitions are cartesian products of multiple dimensions, for example (country, date).
The user needs to do generation of cross-join: that’s additional steps from users, so more possibilities for bugs and this cross-join might be very large (as a result performance impact).

What can be done better?

The user can specify values of each dimensions, and PipelineDP internally can do join: this would be easier to use for users and it might be done much more effectively from performance point of view inside PipelideDP.

Goals

Allow to specify public_partitions as a product of different dimensions. Steps to implement (it might be split in sevaral PRs)

1.Device a nice API for specifying public_paritions in arguments of DPEngine.aggregate (a separate argument, or maybe some class object which specifies product). For beginning we can assume that dimensions values are Python iterable.
2. Implement steps 1 & 2 of public_partitiosn algorithm (see in section above).
3. Propagate these public partitoins in all places where they used (e.g. in Beam API, e.g in Spark API).

The text was updated successfully, but these errors were encountered:

replomancer · 2022-05-23T20:07:19Z

I'd like to work on that.

dvadym · 2022-05-24T06:43:30Z

Thank you!

dvadym added Good first issue 🎓 Perfect for beginners, welcome to OpenMined! Type: New Feature ➕ Introduction of a completely new addition to the codebase labels May 17, 2022

dvadym assigned replomancer May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Public partitions as a cartesian product of dimensions #277

Public partitions as a cartesian product of dimensions #277

dvadym commented May 17, 2022 •

edited

replomancer commented May 23, 2022

dvadym commented May 24, 2022

Public partitions as a cartesian product of dimensions #277

Public partitions as a cartesian product of dimensions #277

Comments

dvadym commented May 17, 2022 • edited

Context

Downsides of the current state.

What can be done better?

Goals

replomancer commented May 23, 2022

dvadym commented May 24, 2022

dvadym commented May 17, 2022 •

edited