-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#5394: Reduce amount of remote calls for TreeReduce and GroupByReduce operators #7245
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,6 +29,7 @@ | |
|
||
from modin.config import ( | ||
BenchmarkMode, | ||
CpuCount, | ||
Engine, | ||
MinPartitionSize, | ||
NPartitions, | ||
|
@@ -621,20 +622,53 @@ def map_partitions( | |
NumPy array | ||
An array of partitions | ||
""" | ||
preprocessed_map_func = cls.preprocess_func(map_func) | ||
return np.array( | ||
[ | ||
if np.prod(partitions.shape) <= 1.5 * CpuCount.get(): | ||
# block-wise map | ||
preprocessed_map_func = cls.preprocess_func(map_func) | ||
new_partitions = np.array( | ||
[ | ||
part.apply( | ||
preprocessed_map_func, | ||
*func_args if func_args is not None else (), | ||
**func_kwargs if func_kwargs is not None else {}, | ||
) | ||
for part in row_of_parts | ||
[ | ||
part.apply( | ||
preprocessed_map_func, | ||
*func_args if func_args is not None else (), | ||
**func_kwargs if func_kwargs is not None else {}, | ||
) | ||
for part in row_of_parts | ||
] | ||
for row_of_parts in partitions | ||
] | ||
for row_of_parts in partitions | ||
] | ||
) | ||
) | ||
else: | ||
# axis-wise map | ||
# we choose an axis for a combination of partitions | ||
# whose size is closer to the number of CPUs | ||
if abs(partitions.shape[0] - CpuCount.get()) < abs( | ||
partitions.shape[1] - CpuCount.get() | ||
): | ||
axis = 1 | ||
else: | ||
axis = 0 | ||
|
||
column_splits = CpuCount.get() // partitions.shape[1] | ||
|
||
if axis == 0 and column_splits > 1: | ||
# splitting by parts of columnar partitions | ||
new_partitions = cls.map_partitions_joined_by_column( | ||
partitions, column_splits, map_func, func_args, func_kwargs | ||
) | ||
else: | ||
# splitting by full axis partitions | ||
new_partitions = cls.map_axis_partitions( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using The dataframe level for choosing a suitable strategy seems more appropriate. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since partition mgr rather than core df is designed to play around with partitions, maybe we should just update the docstring? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, I added |
||
axis, | ||
partitions, | ||
lambda df: map_func( | ||
df, | ||
*(func_args if func_args is not None else ()), | ||
**(func_kwargs if func_kwargs is not None else {}), | ||
), | ||
keep_partitioning=True, | ||
) | ||
return new_partitions | ||
|
||
@classmethod | ||
@wait_computations_if_benchmark_mode | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was the implementation moved to a lower level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was moved here to avoid duplicating logic, and
map_partitions
in the partition manager is only used in these cases.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, your implementation was also used instead of
self._partition_mgr_cls.lazy_map_partitions
function, under some condition, but now it is not. Is that how it was intended?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tree_reduce and groupby_reduce call map_partitions at the partition mgr. That's why @Retribution98 moved the logic there I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see it. I think lazy map will be better in a lazy pipeline because some partitions may not be calculated further, so this issue is not relevant for this case.