-
Notifications
You must be signed in to change notification settings - Fork 849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746
Comments
This has been requested before: #11979 |
Thank you @NVnavkumar for raising this topic. Would you please share more information about this?
|
Addressing these questions here:
I will try to update with some performance numbers soon. |
Is your feature request related to a problem? Please describe.
Some notes from #11979 here: The
$
matches at the position right before a line terminator in regular expressions. In cuDF (and in Python), this is right before a newline\n
. However, in Spark (or rather the JDK), the line terminator can be any one of the following sequences:\r
,\n
,\r\n
,\u0085
,\u2028
, or\u2029
(unless UNIX_LINES mode is activated) (see https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#lt).Describe the solution you'd like
It would be useful if we could configure the concept of line terminator sequences in cuDF. Ideally, this could be an optional parameter that would support a simple array of strings for line terminator sequences. But this also be a flag that enables a
JDK_MODE
which would enabling the more complex handling that can be enabled when calling the corresponding methods from the CUDF Java library.Describe alternatives you've considered
Currently, spark-rapids handles
$
by doing a heavy translation from a JDK regular expression to another regular expression supported by cuDF that handles the multiple possible line terminator sequences that the JDK uses. With this translation, we are limited to only using the$
in simple scenarios at the end of the regular expression, we cannot use them in choice|
right now among other constructions because of the complexity (see NVIDIA/spark-rapids#10764)The text was updated successfully, but these errors were encountered: