Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use back_populates instead of sqlalchemy.orm.backref in bidirectional relationships #39430

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Taragolis
Copy link
Contributor

backref is a legacy approach for construct relationships, instead of that preferable way is use back_populates keyword argument in relationship for built it explicitly


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:Scheduler Scheduler or dag parsing Issues area:serialization labels May 6, 2024
@Taragolis Taragolis changed the title Use back_populates instead of backref in bidirectional relationships Use back_populates instead of sqlalchemy.orm.backref in bidirectional relationships May 6, 2024
@Taragolis Taragolis force-pushed the use-explicit-back-populates branch 2 times, most recently from 1a1e18d to 168fc11 Compare May 6, 2024 14:58
Comment on lines +86 to +90
# This is never executed, but tricks static analyzers (PyDev, PyCharm,)
# into knowing the types of these symbols, and what they contain.
from airflow.models.dag import DAG # noqa: TCH004
from airflow.models.dataset import Dataset # noqa: TCH004
from airflow.models.xcom_arg import XComArg # noqa: TCH004
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of doing noqa you can add these to __all__ instead. Not sure. That would be a lot better if it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already in __all__, as result ruff complain that it should be moved outside of TYPE_CHECKING (TCH004) block because it unable to detect PEP 562 imports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought as an option to add this rule into the in exclusion list into the pyproject.toml, by the same way as we already done for airflow/models/__init__.py

airflow/pyproject.toml

Lines 346 to 348 in 242a169

"airflow/__init__.py" = ["F401"]
"airflow/models/__init__.py" = ["F401", "TCH004"]
"airflow/models/sqla_models.py" = ["F401"]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in a separate PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already: #39750

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as we merge #39750 I will drop commit which add this fix from this pr

@Taragolis Taragolis force-pushed the use-explicit-back-populates branch 3 times, most recently from 1e76071 to 3a0db83 Compare May 16, 2024 12:08
Comment on lines +86 to +90
# This is never executed, but tricks static analyzers (PyDev, PyCharm,)
# into knowing the types of these symbols, and what they contain.
from airflow.models.dag import DAG # noqa: TCH004
from airflow.models.dataset import Dataset # noqa: TCH004
from airflow.models.xcom_arg import XComArg # noqa: TCH004
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in a separate PR


Only makes sense for SchedulerJob and BackfillJob instances.
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this? Let's move it to the task_instances_enqueued. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This do not required anymore, you could achieve it by explicit bidirectional relationships expressions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO

This construction looks more clear

primaryjoin="Job.id == foreign(DagRun.creating_job_id)"

Rather than this

primaryjoin=lambda: Job.id == foreign(_resolve_dagrun_model().creating_job_id)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... my bad. It already moved into the L104

dag_runs = relationship( # Only makes sense for SchedulerJob and BackfillJob instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original comment stored information about two different relationships

"TaskInstances which have been enqueued by this Job." -> task_instances_enqueued
"Only makes sense for SchedulerJob and BackfillJob" -> dag_runs

Copy link
Contributor Author

@Taragolis Taragolis May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some proofs

DAG

from __future__ import annotations

from datetime import datetime, timezone

from airflow.decorators import task
from airflow.models.dag import DAG


START_DATE = datetime(2024, 2, 1, tzinfo=timezone.utc)


with DAG("pr_39430", schedule="@once", tags=["pr", "39430", "bi-directional"], start_date=START_DATE):
    @task
    def do_nothing(): ...

    do_nothing()

Run one time by scheduler and one manual run, by trigger DAG from the UI, results in Tables

Table job

column
id 3 4 1 2
dag_id pr_39430 pr_39430 null null
state success success running running
job_type LocalTaskJob LocalTaskJob TriggererJob SchedulerJob
start_date 2024-05-22 11:45:23.867096 +00:00 2024-05-22 11:45:29.782799 +00:00 2024-05-22 11:44:49.709191 +00:00 2024-05-22 11:44:49.813874 +00:00
end_date 2024-05-22 11:45:24.143576 +00:00 2024-05-22 11:45:30.040894 +00:00 null null
latest_heartbeat 2024-05-22 11:45:23.853896 +00:00 2024-05-22 11:45:29.766046 +00:00 2024-05-22 11:45:35.885734 +00:00 2024-05-22 11:45:37.722701 +00:00
executor_class null null null null
hostname cd2591016c8c cd2591016c8c cd2591016c8c cd2591016c8c
unixname root root root root

Table task_instance

column
task_id do_nothing do_nothing
dag_id pr_39430 pr_39430
run_id scheduled__2024-02-01T00:00:00+00:00 manual__2024-05-22T11:45:28.453199+00:00
map_index -1 -1
start_date 2024-05-22 11:45:23.921193 +00:00 2024-05-22 11:45:29.821869 +00:00
end_date 2024-05-22 11:45:24.085409 +00:00 2024-05-22 11:45:29.979149 +00:00
duration 0.164216 0.15728
state success success
try_number 1 1
max_tries 0 0
hostname cd2591016c8c cd2591016c8c
unixname root root
job_id 3 4
pool default_pool default_pool
pool_slots 1 1
queue default default
priority_weight 1 1
operator _PythonDecoratedOperator _PythonDecoratedOperator
custom_operator_name @task @task
queued_dttm 2024-05-22 11:45:23.394469 +00:00 2024-05-22 11:45:29.546020 +00:00
queued_by_job_id 2 2
pid 2375 2377
executor null null
executor_config 0x80057D942E 0x80057D942E
updated_at 2024-05-22 11:45:23.930163 +00:00 2024-05-22 11:45:29.831787 +00:00
rendered_map_index null null
external_executor_id null null
trigger_id null null
trigger_timeout null null
next_method null null
next_kwargs null null
task_display_name do_nothing do_nothing

Table dag_run

column
id 1 2
dag_id pr_39430 pr_39430
queued_at 2024-05-22 11:45:23.351323 +00:00 2024-05-22 11:45:28.490357 +00:00
execution_date 2024-02-01 00:00:00.000000 +00:00 2024-05-22 11:45:28.453199 +00:00
start_date 2024-05-22 11:45:23.373351 +00:00 2024-05-22 11:45:29.516604 +00:00
end_date 2024-05-22 11:45:24.476332 +00:00 2024-05-22 11:45:30.609079 +00:00
state success success
run_id scheduled__2024-02-01T00:00:00+00:00 manual__2024-05-22T11:45:28.453199+00:00
creating_job_id 2 null
external_trigger false true
run_type scheduled manual
conf 0x80057D942E 0x80057D942E
data_interval_start 2024-02-01 00:00:00.000000 +00:00 2024-05-22 11:45:28.453199 +00:00
data_interval_end 2024-02-01 00:00:00.000000 +00:00 2024-05-22 11:45:28.453199 +00:00
last_scheduling_decision 2024-05-22 11:45:24.474830 +00:00 2024-05-22 11:45:30.606623 +00:00
dag_hash cd6f55b5c736bf50d2f341663c0ad83f cd6f55b5c736bf50d2f341663c0ad83f
log_template_id 2 2
updated_at 2024-05-22 11:45:24.477225 +00:00 2024-05-22 11:45:30.609766 +00:00
clear_number 0 0

Relationships

LocalJob (id: 3) create TI (run_id: scheduled__2024-02-01T00:00:00+00:00)
LocalJob (id: 4) create TI (run_id: manual__2024-05-22T11:45:28.453199+00:00)
SchedulerJob (id: 2) queued TI (run_id: scheduled__2024-02-01T00:00:00+00:00)
SchedulerJob (id: 2) queued TI (run_id: manual__2024-05-22T11:45:28.453199+00:00) - Even if it Manual DAG Run

DagRun (run_id: scheduled__2024-02-01T00:00:00+00:00) created by SchedulerJob (id: 2)
DagRun (run_id: manual__2024-05-22T11:45:28.453199+00:00) doesn't created by any Job

queued_by_job = relationship(
"Job",
back_populates="task_instances_enqueued",
primaryjoin="Job.id == foreign(TaskInstance.queued_by_job_id)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have uselist=False since it was removed from backref at the other end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was removed from backref at the other end?

TBH it the other way round, references right now setup in the both way.

I think we should have uselist=False

Maybe it should be setup uselist=False in both way if it one-to-one relationship

airflow/providers_manager.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler Scheduler or dag parsing Issues area:serialization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants