Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparkSubmitOperator not mark task as success after spark job complete job #39524

Open
1 of 2 tasks
dannyeuu opened this issue May 9, 2024 · 3 comments
Open
1 of 2 tasks
Labels
area:providers kind:bug This is a clearly a bug pending-response provider:apache-spark stale Stale PRs per the .github/workflows/stale.yml policy file

Comments

@dannyeuu
Copy link

dannyeuu commented May 9, 2024

Apache Airflow Provider(s)

apache-spark

Versions of Apache Airflow Providers

apache-airflow-providers-apache-hive==6.1.2
apache-airflow-providers-apache-spark==4.1.1
apache-airflow-providers-cncf-kubernetes==7.3.0
apache-airflow-providers-common-sql==1.6.0

Apache Airflow version

2.6.3

Operating System

Debian GNU/Linux 11 (bullseye)

Deployment

Official Apache Airflow Helm Chart

Deployment details

Dockerfile FROM image apache/airflow:2.6.3-python3.10
airflow helm Chart

What happened

We run lots of jobs every day, but every day one or two tasks the Spark complete the job, but the task still keeps in running state with no new logs, until manually change the state t success.

Task with wrong behavior, the logs keep like this:

[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:13 INFO LoggingPodStatusWatcherImpl: Application status for spark-7edea139decf4f90bd5d0fbac97f8869 (phase: Running)
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:13 INFO LoggingPodStatusWatcherImpl: State changed, new state:
[2024-05-09, 06:25:13 -03] {spark_submit.py:476} INFO - Identified spark driver pod: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - pod name: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - namespace: spark
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - labels: Client -> 3irmaos, Environment -> production, Name -> etl-ira-ciss---3irmaos, Product -> ira, Role -> process, Step -> refinement, airflow-attempt -> 1, airflow-map-index -> , airflow-spark-task-build-version -> spark-k8s-1.0.0, airflow_dag -> cluster_ira_ciss_daily_3irmaos, airflow_task -> load.load_client_product_3irmaos, airflow_task_uuid -> ff01d563-5ea8-555c-8444-413b1f29a5ee, spark-affinty-label -> c8cd4607d6b14f3a81bcce821d0054bb, spark-app-name -> reverent-mcclintock, spark-app-selector -> spark-7edea139decf4f90bd5d0fbac97f8869, spark-role -> driver, spark-version -> 3.4.2, spark_job_execution_date -> 2024-05-09, spark_job_name -> load.load_client_product_3irmaos-2024-05-09, spotinst.io/restrict-scale-down -> true
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - pod uid: 9ed68d10-c28e-4a4b-8136-e68b9be49b9f
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - creation time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - service account name: spark
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - volumes: pod-template-volume, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-69jld
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - node name: i-0097e957e30fe4dd5
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - start time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - phase: Running
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container status:
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container name: spark-kubernetes-driver
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container image: xxxx.com/xxxxxx/pyspark:v2_patch13
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:23:53Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:13Z
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:13 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Application status for spark-7edea139decf4f90bd5d0fbac97f8869 (phase: Running)
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: State changed, new state:
[2024-05-09, 06:25:14 -03] {spark_submit.py:476} INFO - Identified spark driver pod: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - pod name: reverent-mcclintock-2cf0078f5caa813e-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - namespace: spark
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - labels: Client -> 3irmaos, Environment -> production, Name -> etl-ira-ciss---3irmaos, Product -> ira, Role -> process, Step -> refinement, airflow-attempt -> 1, airflow-map-index -> , airflow-spark-task-build-version -> spark-k8s-1.0.0, airflow_dag -> cluster_ira_ciss_daily_3irmaos, airflow_task -> load.load_client_product_3irmaos, airflow_task_uuid -> ff01d563-5ea8-555c-8444-413b1f29a5ee, spark-affinty-label -> c8cd4607d6b14f3a81bcce821d0054bb, spark-app-name -> reverent-mcclintock, spark-app-selector -> spark-7edea139decf4f90bd5d0fbac97f8869, spark-role -> driver, spark-version -> 3.4.2, spark_job_execution_date -> 2024-05-09, spark_job_name -> load.load_client_product_3irmaos-2024-05-09, spotinst.io/restrict-scale-down -> true
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - pod uid: 9ed68d10-c28e-4a4b-8136-e68b9be49b9f
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - creation time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - service account name: spark
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - volumes: pod-template-volume, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-69jld
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - node name: i-0097e957e30fe4dd5
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - start time: 2024-05-09T09:23:51Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - phase: Succeeded
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container status:
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container name: spark-kubernetes-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container image: xxxx.com/xxxxxx/pyspark:v2_patch13
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:23:53Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:13Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Application status for spark-7edea139decf4f90bd5d0fbac97f8869 (phase: Succeeded)
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Container final statuses:
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container name: spark-kubernetes-driver
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container image: xxxx.com/xxxxxx/pyspark:v2_patch13
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:23:53Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:13Z
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:14 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:14 INFO LoggingPodStatusWatcherImpl: Application reverent-mcclintock with submission ID spark:reverent-mcclintock-2cf0078f5caa813e-driver finished

What you think should happen instead

Expected behavior: a normal task will have the taskinstance.py part marking the task as success :

[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - container state: terminated
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - container started at: 2024-05-09T09:24:04Z
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - container finished at: 2024-05-09T09:25:07Z
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - exit code: 0
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - termination reason: Completed
[2024-05-09, 06:25:08 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:08 INFO LoggingPodStatusWatcherImpl: Application elegant-rubin with submission ID spark:elegant-rubin-91cab58f5caaaab2-driver finished
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Shutdown hook called
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-175bae5e-77d3-45b1-aa3c-cc88bdd8a462
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-70c6860c-1925-4858-be1f-23e9c6d5723a
[2024-05-09, 06:25:09 -03] {spark_submit.py:492} INFO - 24/05/09 09:25:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-236d1121-2075-4b6e-bb54-2e3e66decd6a
[2024-05-09, 06:25:09 -03] {taskinstance.py:1345} INFO - Marking task as SUCCESS. dag_id=cluster_ira_ciss_daily_3irmaos, task_id=load.load_competitor_store_3irmaos, execution_date=20240508T070000, start_date=20240509T092347, end_date=20240509T092509
[2024-05-09, 06:25:09 -03] {local_task_job_runner.py:225} INFO - Task exited with return code 0
[2024-05-09, 06:25:09 -03] {taskinstance.py:2653} INFO - 0 downstream tasks scheduled from follow-on schedule check

How to reproduce

It's totally random, happens in different dags with different tasks

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@dannyeuu dannyeuu added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels May 9, 2024
Copy link

boring-cyborg bot commented May 9, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@RNHTTR
Copy link
Collaborator

RNHTTR commented May 9, 2024

Can you see if this is occurring on the latest version of the spark provider package (4.8.0)?

@RNHTTR RNHTTR added pending-response provider:apache-spark and removed needs-triage label for new issues that we didn't triage yet labels May 9, 2024
Copy link

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers kind:bug This is a clearly a bug pending-response provider:apache-spark stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

No branches or pull requests

2 participants