Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pipelines aware of a timezone configuration #249

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

roelschr
Copy link
Contributor

@roelschr roelschr commented Sep 23, 2020

Why? 馃摉

While Spark's TimestampType timezone is controlled by the spark.sql.session.timeZone configuration option, python's datetime objects have their timezone controlled by the system's timezone (when they don't have a fixed tz suffix). This means some transformations can have their timestamps converted in different ways when running on different systems.

An example of possible irregular results happens when we automatically set the start_date of AggregatedFeatureSets (here). Sometimes the spark and the system can have different timezones, meaning that the timestamp coming from the spark dataframe, when collected into plain python as a datetime object can change, generating a start_date different then expected.

What? 馃敡

This PR proposes to apply a timezone configuration that should be aware by each pipeline and that should be the same between spark and system. This timezone is configurable.

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How everything was tested? 馃搹

TODO.

Checklist

  • My code follows the style guidelines of this project (docstrings, type hinting and linter compliance);
  • I have performed a self-review of my own code;
  • I have made corresponding changes to the documentation;
  • I have added tests that prove my fix is effective or that my feature works;
  • New and existing unit tests pass locally with my changes;
  • Add labels to distinguish the type of pull request. Available labels are bug, enhancement, feature, and review.

Attention Points 鈿狅笍

Replace me for what the reviewer will need to pay attention to in the PR or just to cover any concerns after the merge.

@roelschr roelschr added the enhancement New feature or request label Sep 23, 2020
@roelschr roelschr requested a review from a team as a code owner September 23, 2020 18:16
@roelschr roelschr self-assigned this Sep 23, 2020
@sonarcloud
Copy link

sonarcloud bot commented Sep 23, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Contributor

@AlvaroMarquesAndrade AlvaroMarquesAndrade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! You could, however, add a test. What do you think?

@roelschr roelschr added the work in progress Work in progress label Sep 23, 2020
Comment on lines +21 to +22
timezone: timestamp feature transformations will assume this timezone
when they don't have a tz suffix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just inform here that spark con config (spark.sql.session.timeZone) and an env variable (TZ) will be set with this value.

@@ -1,4 +1,6 @@
"""FeatureSetPipeline entity."""
import os

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request work in progress Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants