[Q] Offline currently running jobs shows as "Finished" in the UI, am I doing it wrong? #7376

ankile · 2024-04-13T20:16:32Z

Hi

I'm running model training on a compute cluster where the compute nodes do not have access to the internet. Therefore, while the jobs are running, I'm calling wandb sync at regular intervals from a node with internet access that shares the same file system, so that I can follow the training in the UI. When I do this, however, the jobs that are running are listed as "Finished" in the UI throughout the whole run, which makes downstream evaluation problematic as it's set to only run eval on finished jobs.

My question is, is the expected behavior, or am I doing something wrong?

Best, Lars

The text was updated successfully, but these errors were encountered:

anmolmann · 2024-04-19T21:56:42Z

Hi @ankile, this does seem like a bug on our end where a live run is being marked as finished upon being synced. We did fix this issue a few years ago and it seems like this is a regression. Here's the condition in our code which handles this, we'll investigate further internally and identify the root cause.

I was able to reproduce this with the following script as well:

import wandb
import time
import numpy as np
import pandas as pd

run = wandb.init(project="<project_name>")
for i in range(0, 100):
    table = wandb.Table(data=pd.DataFrame(np.random.randint(0, 1001, size=(3, 3)),
                                          columns=['a', 'b', 'c']),
                        columns=['a', 'b', 'c'])
    wandb.log({"table": table})
    time.sleep(5)

wandb.finish()

On a side note, W&B recommends syncing an offline run once it finishes (after run.finish() is called) to avoid running into any error states. The reason is if one tries to sync an active run, the config of the run and some other files are not up-to-date at the point of syncing and only partial data is uploaded. You should see a warning such as:

 WARNING .wandb file is incomplete (record checksum is invalid, data may be corrupt), be sure to sync this run again once it's finished

in your console when syncing an active run.

ankile · 2024-05-08T21:29:48Z

Thank you so much for your response and this info!

Have you had a chance to look more into why this is happening and how to fix it @anmolmann? Alternatively, are there any escape hatches I could implement directly in the wandb code to fix it while we wait for a fix to be released?

kptkin added a:cli Area: Client c:sync Component: Synchronization labels Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q] Offline currently running jobs shows as "Finished" in the UI, am I doing it wrong? #7376

[Q] Offline currently running jobs shows as "Finished" in the UI, am I doing it wrong? #7376

ankile commented Apr 13, 2024

anmolmann commented Apr 19, 2024 •

edited

ankile commented May 8, 2024

[Q] Offline currently running jobs shows as "Finished" in the UI, am I doing it wrong? #7376

[Q] Offline currently running jobs shows as "Finished" in the UI, am I doing it wrong? #7376

Comments

ankile commented Apr 13, 2024

anmolmann commented Apr 19, 2024 • edited

ankile commented May 8, 2024

anmolmann commented Apr 19, 2024 •

edited