Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement Spark-compatible CAST from String to Date #383

Merged
merged 21 commits into from
May 23, 2024

Conversation

vidyasankarv
Copy link
Contributor

@vidyasankarv vidyasankarv commented May 5, 2024

Which issue does this PR close?

Closes #327.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

@andygrove andygrove changed the title support casting DateType in comet feat: Implement Spark-compatible CAST from String to Date May 5, 2024
@@ -107,7 +108,23 @@ macro_rules! cast_utf8_to_timestamp {
result
}};
}

macro_rules! cast_utf8_to_date {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why this is a macro and not a function? I see only one usage (maybe I missed something).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @parthchandra thanks for the review. I am new to rust (and also my first attempt at an oss contribution).
it was my first stab at this issue and was imitating the code from cast_utf8_to_timestamp. removed it now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am new to rust

So am I :). Welcome to the community !

{
Self::spark_cast_int_to_int(&array, self.eval_mode, from_type, to_type)?
}
if self.eval_mode != EvalMode::Try =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like unnecessary re-formatting (multiple places).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaned this up now with cargo fmt

@vidyasankarv
Copy link
Contributor Author

vidyasankarv commented May 8, 2024

This PR is still in progress. I added support for String to Date32.

  • Spark supports dates in the format YYYY and YYYY-MM and DataFusion does not - supported now
  • Spark supports a trailing T as in 2024-01-01T and DataFusion does not - supported now
  • DataFusion doesn't throw an exception for invalid inputs in ANSI mode - returns error in ANSI mode if date cant be parsed.

Hi @parthchandra @andygrove Can you please review if this is going in the right direction.

Questions:

@andygrove
Copy link
Member

andygrove commented May 8, 2024

This PR is still in progress. I added support for String to Date32.

  • Spark supports dates in the format YYYY and YYYY-MM and DataFusion does not - supported now
  • Spark supports a trailing T as in 2024-01-01T and DataFusion does not - supported now
  • DataFusion doesn't throw an exception for invalid inputs in ANSI mode - returns error in ANSI mode if date cant be parsed.

Hi @parthchandra @andygrove Can you please review if this is going in the right direction.

Questions:

@vidyasankarv Yes, I would say this is going it a good direction based on a very quick review. I will try and find more time tomorrow for a deeper look. To answer your questions:

  • Yes, ideally we should support the edge cases mentioned in the issue. We could also choose to leave that for a future PR and leave the current support marked as incompatible and provide some documentation on what is not supported (as we have done for string -> timestamp).
  • No need to support Date64 for this issue

@@ -954,13 +993,63 @@ fn parse_str_to_time_only_timestamp(value: &str) -> CometResult<Option<i64>> {
Ok(Some(timestamp))
}

fn date_parser(value: &str, eval_mode: EvalMode) -> CometResult<Option<i32>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't familiar with Spark's string to date conversion so I took a look. (https://github.com/apache/spark/blob/9d79ab42b127d1a12164cec260bfbd69f6da8b74/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L312)
From the comment the allowed formats are -

  * `[+-]yyyy*`
   * `[+-]yyyy*-[m]m`
   * `[+-]yyyy*-[m]m-[d]d`
   * `[+-]yyyy*-[m]m-[d]d `
   * `[+-]yyyy*-[m]m-[d]d *`
   * `[+-]yyyy*-[m]m-[d]dT*`

I honestly don't know what a string with a 'plus/minus' at the beginning of the date even means but you might want to handle that case.
Also, the max number of digits allowed for the year is 7.
Finally, once you've got the 'day' segment of the date you may have a ' ' or 'T' (you're only handling the latter) and the characters after that are discarded.
It looks to me like Spark's custom implementation might be slightly faster since it manages to achieve the split of the string into segments and the parsing of the digits in a single pass. (also does not need to prepare the parser with the format string). You might want to consider doing the same.

@parthchandra
Copy link
Contributor

You're pretty close to handling these cases (see my comment in the review)

@vidyasankarv
Copy link
Contributor Author

vidyasankarv commented May 11, 2024

hi @parthchandra @andygrove made changes as suggested
ported the date parsing logic from SparkDateTimeUtils -
The previous one reads much simpler - though it was missing a couple of features like handling trialing spaces like extra spaces diff

This PR does not support fuzzy tests in CometCastSuite for test Date to String as Naive Date only supports dates in the below range and the dates generated by fuzzy test for matching with spark are out of this range and hence cause a mismatch between results.

/// The minimum possible `Na iveDate` (January 1, 262145 BCE).
#[deprecated(since = "0.4.20", note = "Use NaiveDate::MIN instead")]
pub const MIN_DATE: NaiveDate = NaiveDate::MIN;
/// The maximum possible `NaiveDate` (December 31, 262143 CE).
#[deprecated(since = "0.4.20", note = "Use NaiveDate::MAX instead")]
pub const MAX_DATE: NaiveDate = NaiveDate::MAX;

For Ansi mode for any format validation results returns an error, where as other modes return None.
However when all validations pass, for dates that are beyond dates supported by NaiveDate all modes return None.

@parthchandra Regarding these points

  • I honestly don't know what a string with a 'plus/minus' at the beginning of the date even means but you might want to handle that case - found something here

    • If the year is between 9999 BCE and 9999 CE, the result is of the form -YYYY-MM-DD and YYYY-MM-DD respectively. For years prior or after this range, the necessary number of digits are added to the year component and + is used for CE.
  • Also, the max number of digits allowed for the year is 7 - supporting daes upto 7 numbers would be beyond NaiveDates support - so this will be point of difference between native spark and spark with comet

Can you please take another look at the PR. thank you.

@andygrove
Copy link
Member

Thanks @vidyasankarv. I plan on carefully reviewing this later today.

current_segment += 1;
} else {
//increment value of current segment by the next digit
let parsed_value = (b - b'0') as i32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line triggers an overflow panic if b is less than 0. It looks like this code assumes that b is a digit, but with the input 3/, it failed here on processing /. Perhaps check that b is a digit first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove added check for ascii digits and some negative test cases around this in rust code. thank you

castTest(generateStrings(datePattern, 8).toDF("a"), DataTypes.DateType)
castTest(
Seq(
"262142-01-01",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some invalid entries here as well so that we can ensure that Comet throws errors for invalid inputs when ANSI is enabled?

Some suggestions:

        "",
        "0",
        "not_a_date",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added additional negative test cases as suggested.

@@ -119,7 +119,7 @@ object CometCast {
Unsupported
case DataTypes.DateType =>
// https://github.com/apache/datafusion-comet/issues/327
Unsupported
Compatible()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it seems we are compatible for most common use cases, we are not 100% compatible, so should add a note here.

Suggested change
Compatible()
Compatible(Some("Only supports years between 262143 BC and 262142 AD"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not cover these cases in this PR, we should add a test with ignore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed Compatible as suggested
also added back an ignore test case for fuzz test as a placeholder

// https://github.com/apache/datafusion-comet/issues/327
castTest(generateStrings(datePattern, 8).toDF("a"), DataTypes.DateType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep generateStrings because that covers random value tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generateStrings method wasnt removed, just the fuzz test for Dates we removed, but now added back the fuzz test for Dates as an ignore test some dates are not supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a suggestion for adding the fuzzing back but filtering out values that we know are not supported

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can remove the ignored test if everyone is happy with the suggested changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed now and added a filtered fuzzy test

@vidyasankarv
Copy link
Contributor Author

vidyasankarv commented May 14, 2024

Hi @andygrove
I need some help
https://github.com/apache/datafusion-comet/pull/383/files#diff-41ecdd113d7a7afe33447e34f1ff0b5ed3033a89bfbcefa9e7e259d7a6e4daecR593

These particular test cases result in failure - like when i run with this sample test 2020-10-010T - the same test in rust returns CometError in ansi mode https://github.com/apache/datafusion-comet/pull/383/files#diff-b7339cca414a6315488506dd33654946f62c229feb8ad0d4abeda683ca75b4b5R1717

However when the ComeTestSuite for Test String to Date - runs it fails for the combination of comet ansi enabled without try_cast.

I have tried to concurrently debug on CLION https://github.com/apache/datafusion-comet/blob/main/docs/source/contributor-guide/debugging.md - however the breakpoints show us disabled and are nt hitting - i tried switching to lldb in the clion toolchain too but no help. I added some additional logging locally and can see that it returns a CometError for the invalid value on the rust side as expected, but returns None for Comet side when running from the scala test suite.

Is there something else I could be missing in terms of any configuration. Appreciate any help when you get some time.

Thank you

Comment on lines 662 to 663
} else if let Ok(Some(cast_value)) =
date_parser(string_array.value(i), eval_mode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vidyasankarv This is the fix you need to make your current test pass.

The problem was that we were ignoring any error here when runnin in ANSI mode.

Suggested change
} else if let Ok(Some(cast_value)) =
date_parser(string_array.value(i), eval_mode)
} else if let Some(cast_value) = date_parser(string_array.value(i), eval_mode)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry my mistake missed checking here.
handling all cases of return value of date_parser using matching clause now.
thank you @andygrove

"2020-mar-20",
"not_a_date",
"T2")
castTest((validDates ++ invalidDates).toDF("a"), DataTypes.DateType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add fuzzing back here, but filter out values that we know that we cannot support.

Suggested change
castTest((validDates ++ invalidDates).toDF("a"), DataTypes.DateType)
// due to limitations of NaiveDate we only support years between 262143 BC and 262142 AD"
// we can't test all possible fuzz dates
val unsupportedYearPattern: Regex = "^\\s*[0-9]{5,}".r
val fuzzDates = generateStrings(datePattern, 8)
.filterNot(str => unsupportedYearPattern.findFirstMatchIn(str).isDefined)
castTest((validDates ++ invalidDates ++ fuzzDates).toDF("a"), DataTypes.DateType)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the suggestion @andygrove incorporated this as suggested.

@@ -563,8 +565,54 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper {
castTest(generateStrings(numericPattern, 8).toDF("a"), DataTypes.BinaryType)
}

ignore("cast StringType to DateType") {
test("cast StringType to DateType") {
// https://github.com/apache/datafusion-comet/issues/327
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit lets move this comment of the issue number to cast StringType to DataType - Fuzz Test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ignore test has been done away with as per @andygrove 's suggestion above using a filtered fuzz test, so removed this now.

//a string to date parser - port of spark's SparkDateTimeUtils#stringToDate.
fn date_parser(date_str: &str, eval_mode: EvalMode) -> CometResult<Option<i32>> {
// local functions
fn get_trimmed_start(bytes: &[u8]) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this defined because we cannot use String#trim() or trim_matches?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this is a direct port of Spark's logic. Spark skips characters rather than calling trim because it is more efficient (avoids extra memory allocation). It is possible that the code could be optimized more to take advantage of zero-cost abstractions in Rust, but I think we should look at optimizations as a follow up if we determine that performance needs improving.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kazuyukitanimura ye this is a port of scala's implementation https://github.com/apache/spark/blob/7e79e91dc8c531ee9135f0e32a9aa2e1f80c4bbf/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L312 was suggested @parthchandra in a previous comment.
leaving as is for now based on above comment from @andygrove. hope thats ok with you too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark skips characters rather than calling trim because it is more efficient (avoids extra memory allocation).

effectively a str slice

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 34.17%. Comparing base (14494d3) to head (ae575e5).
Report is 11 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main     #383      +/-   ##
============================================
+ Coverage     34.02%   34.17%   +0.15%     
+ Complexity      857      850       -7     
============================================
  Files           116      116              
  Lines         38565    38547      -18     
  Branches       8517     8523       +6     
============================================
+ Hits          13120    13174      +54     
+ Misses        22691    22608      -83     
- Partials       2754     2765      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andygrove
Copy link
Member

There is one test failure with JDK 8 / Spark 3.2:

- cast StringType to DateType *** FAILED *** (349 milliseconds)
  "[CAST_INVALID_INPUT] The value '0' of the type "STRING" cannot be cast to "DATE" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error." did not contain "annot cast 0 to DateType." (CometCastSuite.scala:1069)

@andygrove
Copy link
Member

@vidyasankarv I would suggest that we skip the test for now when running against Spark 3.2 and file a follow on issue to fix 3.2 compatibility (this may not be a high priority since 3.2 is quite old and we should consider dropping support for it at some point.

You can add an assume call to the test to skip for certain criteria:

  test("cast StringType to DateType") {
    assume(CometSparkSessionExtensions.isSpark33Plus)

It would be good to add a comment in here as well with a link to the follow on issue (could you file that?)

@vidyasankarv
Copy link
Contributor Author

@vidyasankarv I would suggest that we skip the test for now when running against Spark 3.2 and file a follow on issue to fix 3.2 compatibility (this may not be a high priority since 3.2 is quite old and we should consider dropping support for it at some point.

You can add an assume call to the test to skip for certain criteria:

  test("cast StringType to DateType") {
    assume(CometSparkSessionExtensions.isSpark33Plus)

It would be good to add a comment in here as well with a link to the follow on issue (could you file that?)

@andygrove thank you for suggestions - filed this issue #440 and linked in the test.

@vidyasankarv
Copy link
Contributor Author

vidyasankarv commented May 20, 2024

https://github.com/apache/datafusion-comet/suites/23883332179/logs?attempt=2

@andygrove
This build included the fuzz test cast String to DateType - 19fa952 - as recommended here #383 (comment)

From the logs for ubuntu-latest/java 17-spark-3.4-scala-2.12/ - https://github.com/apache/datafusion-comet/suites/23883332179/logs?attempt=2

  • all return cast values on comet side are showing null.

Additionally the same sample dates from above pass otherwise without fuzz test https://github.com/apache/datafusion-comet/pull/383/files#diff-41ecdd113d7a7afe33447e34f1ff0b5ed3033a89bfbcefa9e7e259d7a6e4daecR577-R585 and the test report also shows them as null on Comet side in the comparison when the fuzz tests are included.

https://github.com/apache/datafusion-comet/actions/runs/9123082449/job/25132235801

  ![262142-01-01,2142-01-01]               [262142-01-01,null]
  ![262142-01-01 ,2142-01-01]              [262142-01-01 ,null]
  ![262142-01-01T ,2142-01-01]             [262142-01-01T ,null]
  ![262142-01-01T 123123123,2142-01-01]    [262142-01-01T 123123123,null]
   [263,null]                              [263,null]

if you search for 262142-01-01 in the logs you can see it reports as failing on lines 14035 to 14038 as above

similarly if you also search for dates -262143-12-31 on lines 10167 to 10171

   [--262143-12-31,null]                   [--262143-12-31,null]
   [--262143-12-31T 1234 ,null]            [--262143-12-31T 1234 ,null]
  ![-262143-12-31,2144-12-31]              [-262143-12-31,null]
  ![-262143-12-31 ,2144-12-31]             [-262143-12-31 ,null]
  ![-262143-12-31T,2144-12-31]             [-262143-12-31T,null]
  ![-262143-12-31T ,2144-12-31]            [-262143-12-31T ,null]
  ![-262143-12-31T 123123123,2144-12-31]   [-262143-12-31T 123123123,null]

I have spent a fair time trying to understand why this is happening , but am unable to identify the issue.

I have added some more samples from fuzzy dates into my current unit tests in rust tests and CometCastSuite
all of them are passing without the fuzz test locally.
2b4c204

So I have pushed this build removing the fuzz test now to see if the build passes. So I might need some help in trying to identify the issue with fuzz test.
Apologies for taking your time on this again. Thank you for your help

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks @vidyasankarv

//a string to date parser - port of spark's SparkDateTimeUtils#stringToDate.
fn date_parser(date_str: &str, eval_mode: EvalMode) -> CometResult<Option<i32>> {
// local functions
fn get_trimmed_start(bytes: &[u8]) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark skips characters rather than calling trim because it is more efficient (avoids extra memory allocation).

effectively a str slice

@andygrove
Copy link
Member

So I have pushed this build removing the fuzz test now to see if the build passes. So I might need some help in trying to identify the issue with fuzz test.
Apologies for taking your time on this again. Thank you for your help

Thanks @vidyasankarv. I plan on looking into this tomorrow. Overall, the PR looks good.

@andygrove
Copy link
Member

@vidyasankarv I am also very confused .. values that fail in the fuzz test work in the other test 🤔

I am debugging and will let you know when I get to the bottom of this mystery

@andygrove
Copy link
Member

@vidyasankarv I figured out what the issue is.

I don't fully understand why, but when the fuzz test creates the DataFrame, the cast operation that gets performed is from a dictionary array not a string array:

cast_array(from=Dictionary(Int32, Utf8), to_type=Date32)

This means that we are not even calling your native date_parser but instead falling through to this catchall logic:

_ => {
    // when we have no Spark-specific casting we delegate to DataFusion
    cast_with_options(&array, to_type, &CAST_OPTIONS)?
}

The solution is to add a specific match for casting dictionary to date:

            (
                DataType::Dictionary(key_type, value_type),
                DataType::Date32,
            ) if key_type.as_ref() == &DataType::Int32
                && (value_type.as_ref() == &DataType::Utf8
                || value_type.as_ref() == &DataType::LargeUtf8) =>
            {
                match value_type.as_ref() {
                    DataType::Utf8 => {
                        let unpacked_array =
                            cast_with_options(&array, &DataType::Utf8, &CAST_OPTIONS)?;
                        Self::cast_string_to_date(&unpacked_array, to_type, self.eval_mode)?
                    }
                    DataType::LargeUtf8 => {
                        let unpacked_array =
                            cast_with_options(&array, &DataType::LargeUtf8, &CAST_OPTIONS)?;
                        Self::cast_string_to_date(&unpacked_array, to_type, self.eval_mode)?
                    }
                    dt => unreachable!(
                        "{}",
                        format!("invalid value type {dt} for dictionary-encoded string array")
                    ),
                }
            },

@vidyasankarv
Copy link
Contributor Author

@andygrove thank you very much for looking into this. tested fuzz test with your suggestions and is working now. pushed changes in the latest commit 88af45c.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending CI. Thank you for your patience @vidyasankarv.

Once this is merged I will rebase #461 which would have prevented some of the issues we ran into on this PR

@andygrove andygrove merged commit a7272b9 into apache:main May 23, 2024
40 checks passed
@vidyasankarv vidyasankarv deleted the #327 branch May 23, 2024 17:41
@vidyasankarv
Copy link
Contributor Author

vidyasankarv commented May 23, 2024

@andygrove @parthchandra @kazuyukitanimura thank you for reviews and support in helping me through my first open source contribution. Its been a great learning experience. Still trying to grasp all the new things I learnt from this seemingly simple good first issue. My first exposure to rust, seeing JNI in action for interactions between spark and comet using Arrow. Hope to keep contributing. And waiting for @andygrove's updated version of his How Query Engine's Work . Thank you all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Spark-compatible CAST from String to Date
5 participants