FEAT-#6301: Simplify usage of algebra operators to define custom functions #6302

dchigarev · 2023-06-27T18:35:03Z

What do these changes do?

This PR introduces the Operator.apply(...) method that takes and outputs modin.pandas.DataFrame, and is able to apply the specified function using the operator's scheme directly to high-level dataframes.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Simplify the process of using UDFs algebra operators #6301
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev · 2023-06-27T18:44:43Z

modin/core/dataframe/algebra/binary.py

@@ -297,7 +297,7 @@ def register(
 """

 def caller(
- query_compiler, other, broadcast=False, *args, dtypes=None, **kwargs
+ query_compiler, other, *args, broadcast=False, dtypes=None, **kwargs


changed the order so the function's positional arguments won't conflict with the keyword broadcast arg.

modin/core/dataframe/algebra/binary.py

modin/test/storage_formats/pandas/test_internals.py

YarShev · 2023-06-29T08:21:41Z

modin/core/dataframe/algebra/binary.py

+ -------
+ The same type as `df`.
+ """
+ from modin.pandas import Series


I don't really like that inner layers depend on upper layers. I do not see any benefit of introducing these changes rather than it will simplify registriation of a UDF for users, which doesn't happen very often from my point of view. I would like to bring more attention to these changes to decide whether we want these changes to be merged or not.

cc @modin-project/modin-core

I agree with @YarShev. I'd rather avoid the dependency on a higher layer.

maybe we can make the .apply() method a separate function and place it somewhere at modin.pandas.utils, the function would look something like:

# modin/pandas/utils.py def apply_operator(df : modin.pandas.DataFrame, operator_cls : type[Operator], func, *args, **kwargs): res_qc = operator_cls.register(func)(df._query_compiler, *args, **kwargs) return type(df)(query_compiler=res_qc) # the usage then would be: from modin.pandas.utils import apply_operator from modin.core.dataframe.algebra import Reduce res_df = apply_operator(df, Reduce, func=reduce_func, axis=1)

One of the problem I see here is that managing operators-dependent behavior can be quite a pain here, as we can no longer use OOP mechanisms to align with operator-specific logic using inheritance and overriding:

# modin/pandas/utils.py def _apply_reduce_op(...): ... def _apply_groupby_op(...): ... ... _operators_dict = { Reduce: _apply_reduce_op, GroupbyReduce: _apply_groupby_op, ... } def apply_operator(df : modin.pandas.DataFrame, operator_cls : type[Operator], func, *args, **kwargs): return _operators_dict[operator_cls](df, func, *args, **kwargs)

Do you have any other suggestion on how to improve this approach (or maybe you have another approach in mind)?

@dchigarev to me the current approach for using the operators doesn't look very verbose to me. will this PR make it much easier to use the operators anywhere?

will this PR make it much easier to use the operators anywhere

The PR should make the usage of operators much easier for end-users of modin who would like to define their own distributed functions using modin's operators.

While optimizing customer workloads for modin we sometimes see places that would perform much better if rewritten from pandas API using modin's operators, however the present API the operators provide causes a lot of complex code written around that customers struggle to understand. That inspired us to create a simple method/function that makes operator's usage as simple as calling one single function.

I don't really like that inner layers depend on upper layers.

@YarShev @mvashishtha

So I made two versions of how we can eliminate this dependency.

Avoid importing objects from modin.pandas... to the algebra level, but still allow passing such objects to the Operator.apply(). This way we're getting rid of the 'import dependency' on the higher level, meaning that we can easily detach the algebra layer if needed without worrying that it would require to bear stuff from the higher levels for the algebra to work correctly.

I've made the changes to align with this approach and pushed it to the branch from this PR.

Also avoid passing dataframe objects to the Operator.apply() and rework this method to accept query compilers only. Then add a helper function somewhere to the dataframe level that would take series/dataframes, extract their QCs, and pass to them the Operator.apply().

I've implemented this approach in a separate branch. There, users have a function at modin.pandas.utils.apply_operator with the following signature:

def apply_operator(operator_cls, *args, **kwargs): pass ... # use case example from modin.pandas.utils import apply_operator from modin.core.dataframe.algebra import Reduce series_obj = apply_operator(Reduce, df, reduce_func, axis=1)

I don't really like this approach as apply_operator() doesn't provide meaningful signature and requires referring to the Operator.apply() for the list of allowed parameters.

@mvashishtha @YarShev thoughts?

@dchigarev I wonder if you could provide an example of user-defined Modin operator (ideally a real case, even if simplified and anonymized)

I wonder if you could provide an example of user-defined Modin operator

@vnlitvinov

What we usually use the user-defined operators for is to emulate lazy execution for those types of transformations that can't be written into a one or two pandas API calls (usually such transformations are performed in a for-loop).

As an example, consider that we have a dataframe with multiple string columns, and then we want to combine those columns into a single one using the specified separator. Surprisingly, but the fastest way to do so using vanilla pandas is simply writing a for-loop:

combining multiple string columns using different approaches in pandas: for-loop: 17.533942513167858 df.apply(join): 21.597900850698352 df.str.cat(): 38.55254930164665

pandas code

import pandas as pd import numpy as np from timeit import default_timer as timer NCOLS = 16 NROWS = 5_000_000 df = pd.DataFrame({f"col{i}": [f"col{i}-{j}" for j in range(NROWS)] for i in range(NCOLS)}) t1 = timer() res = df.iloc[:, 0] for col in df.columns[1:]: res += "_" + df[col] print(f"for-loop: {timer() - t1}") t1 = timer() res = df.apply(lambda row: "_".join(row), axis=1) print(f"df.apply(join): {timer() - t1}") t1 = timer() res = df.iloc[:, 0].str.cat(df.iloc[:, 1:], sep="_") print(f"df.str.cat(): {timer() - t1}")

Then when adapting this code to modin, it appears that the for-loop approach works very slow due to a lot of kernels being submitted to Ray and so causing it to overwhelm (each iteration of the for-loop will result into 3 separate kernels: 1. df[col]; 2. "_" + df[col]; 3. res +=). And then it appears, that the most performant approach is to submit this for-loop as a single kernel using the Reduction operator:

combining multiple string columns using different approaches in modin: reduction operator: 2.6848975336179137 batch pipeline API: 2.945119895040989 for-loop: 36.92861177679151 df.apply(join): 8.54124379903078 df.str.cat(): 43.84469765238464

modin code

import modin.pandas as pd import modin.config as cfg import numpy as np from timeit import default_timer as timer cfg.BenchmarkMode.put(True) # start all the workers pd.DataFrame([np.arange(cfg.MinPartitionSize.get()) for _ in range(cfg.NPartitions.get() ** 2)]).to_numpy() NCOLS = 16 NROWS = 5_000_000 df = pd.DataFrame({f"col{i}": [f"col{i}-{j}" for j in range(NROWS)] for i in range(NCOLS)}) from modin.core.dataframe.algebra import Reduce def reduction(df): res = df.iloc[:, 0] for col in df.columns[1:]: res += "_" + df[col] return res t1 = timer() res = Reduce.apply(df, reduction, axis=1) print(f"reduction operator: {timer() - t1}") from modin.experimental.batch import PandasQueryPipeline t1 = timer() pipeline = PandasQueryPipeline(df) pipeline.add_query(reduction, is_output=True) res = pipeline.compute_batch() print(f"batch pipeline API: {timer() - t1}") t1 = timer() res = df.iloc[:, 0] for col in df.columns[1:]: res += "_" + df[col] print(f"for-loop: {timer() - t1}") t1 = timer() res = df.apply(lambda row: "_".join(row), axis=1) print(f"df.apply(join): {timer() - t1}") t1 = timer() res = df.iloc[:, 0].str.cat(df.iloc[:, 1:], sep="_") print(f"df.str.cat(): {timer() - t1}")

(as I was writing this comment, I found out about the batch API in modin that is supposed to serve exactly the same purpose of "emulating" the lazy execution. However, it seems that it doesn't provide a way to specify the scheme on how the kernels actually should be submitted (map, row-wise, column-wise, ...) and also have some slight overhead when comparing with the pure user-defined operator's approach)

YarShev · 2023-06-29T08:23:45Z

modin/core/dataframe/algebra/binary.py

+ left : modin.pandas.DataFrame or modin.pandas.Series
+ Left operand.
+ right : modin.pandas.DataFrame or modin.pandas.Series
+ Right operand.


This layer depends on the upper layer in every apply.

The lower layer still takes the object(s) from the API layer, namely, modin.pandas.DataFrame/Series. Then, there is a func, which takes pandas.DataFrame/Series. Also, there is a kwargs argument that needs to be passed to the cls.register(). What is cls.register for the user? Doesn't this look a little complicated for the user? So many things to understand. I am also thinking that we are trying to simplify the things not for the user, but for us ourselves, when we are re-writing a customer workload to get better performance.

Signed-off-by: Dmitry Chigarev <[email protected]>

vnlitvinov · 2023-07-12T07:16:21Z

modin/core/dataframe/algebra/binary.py

+ Left operand.
+ right : modin.pandas.DataFrame or modin.pandas.Series
+ Right operand.
+ func : callable(pandas.DataFrame, pandas.DataFrame, \*args, axis, \*\*kwargs) -> pandas.DataFrame


the signature is wrong here, as the implementation explicitly passes Query Compilers as arguments...

nope, func here is a kernel that will be applied to deserialized partitions (pandas dataframes) so the signature is correct

follow the track of the func:

register(func)

register(func) -> modin_dataframe.apply_full_axis(func, ...)

vnlitvinov · 2023-07-12T07:16:49Z

modin/core/dataframe/algebra/binary.py

+ -------
+ The same type as `df`.
+ """
+ operator = cls.register(func, **kwargs)


I wonder what is the purpose of .register for a one-off thing

Unfortunately, that's the only way of how we can get the caller function

modin/core/dataframe/algebra/binary.py

modin/core/dataframe/algebra/fold.py

modin/core/dataframe/algebra/binary.py

Signed-off-by: Dmitry Chigarev <[email protected]>

vnlitvinov

@dchigarev so this is a way to implement some user-defined functions, right? I wonder if we can actually try to take the leaf out of pandas/numpy book the way they support UDF-s...

dchigarev added 5 commits June 27, 2023 11:56

Dtaft implementation of UDFs in operators

990ec17

Signed-off-by: Dmitry Chigarev <[email protected]>

Add tests

ac9732e

Signed-off-by: Dmitry Chigarev <[email protected]>

Add doc-strings

5899084

Signed-off-by: Dmitry Chigarev <[email protected]>

Update docs

e9f5dcd

Signed-off-by: Dmitry Chigarev <[email protected]>

fix doc-strings

ee354fd

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev commented Jun 27, 2023

View reviewed changes

dchigarev marked this pull request as ready for review June 27, 2023 23:25

dchigarev requested a review from a team as a code owner June 27, 2023 23:25

Garra1980 reviewed Jun 28, 2023

View reviewed changes

modin/core/dataframe/algebra/binary.py Outdated Show resolved Hide resolved

dchigarev commented Jun 28, 2023

View reviewed changes

modin/core/dataframe/algebra/binary.py Outdated Show resolved Hide resolved

Update modin/core/dataframe/algebra/binary.py

884224e

dchigarev commented Jun 28, 2023

View reviewed changes

modin/test/storage_formats/pandas/test_internals.py Outdated Show resolved Hide resolved

Update modin/test/storage_formats/pandas/test_internals.py

96089da

YarShev reviewed Jun 29, 2023

View reviewed changes

Remove high-level imports in algebra operators

d65387e

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev requested review from YarShev and mvashishtha July 5, 2023 15:57

vnlitvinov reviewed Jul 12, 2023

View reviewed changes

dchigarev added 2 commits July 14, 2023 12:50

Apply @vnlitvinov's suggestions

3f3cc5b

Signed-off-by: Dmitry Chigarev <[email protected]>

Merge remote-tracking branch 'origin/master' into row_wise_udfs

52eeecc

vnlitvinov reviewed Jul 21, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#6301: Simplify usage of algebra operators to define custom functions #6302

FEAT-#6301: Simplify usage of algebra operators to define custom functions #6302

dchigarev commented Jun 27, 2023 •

edited

dchigarev Jun 27, 2023

YarShev Jun 29, 2023 •

edited

mvashishtha Jun 29, 2023

dchigarev Jun 29, 2023

mvashishtha Jun 29, 2023

dchigarev Jun 29, 2023

dchigarev Jul 5, 2023 •

edited

Garra1980 Jul 11, 2023

vnlitvinov Jul 12, 2023

dchigarev Jul 14, 2023

YarShev Jun 29, 2023

YarShev Jul 19, 2023

vnlitvinov Jul 12, 2023

dchigarev Jul 14, 2023

vnlitvinov Jul 12, 2023

dchigarev Jul 14, 2023

vnlitvinov left a comment

FEAT-#6301: Simplify usage of algebra operators to define custom functions #6302

Are you sure you want to change the base?

FEAT-#6301: Simplify usage of algebra operators to define custom functions #6302

Conversation

dchigarev commented Jun 27, 2023 • edited

What do these changes do?

Choose a reason for hiding this comment

YarShev Jun 29, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchigarev Jul 5, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vnlitvinov left a comment

Choose a reason for hiding this comment

dchigarev commented Jun 27, 2023 •

edited

YarShev Jun 29, 2023 •

edited

dchigarev Jul 5, 2023 •

edited