Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Q] Offline currently running jobs shows as "Finished" in the UI, am I doing it wrong? #7376

Open
ankile opened this issue Apr 13, 2024 · 2 comments
Labels
a:cli Area: Client c:sync Component: Synchronization

Comments

@ankile
Copy link

ankile commented Apr 13, 2024

Hi

I'm running model training on a compute cluster where the compute nodes do not have access to the internet. Therefore, while the jobs are running, I'm calling wandb sync at regular intervals from a node with internet access that shares the same file system, so that I can follow the training in the UI. When I do this, however, the jobs that are running are listed as "Finished" in the UI throughout the whole run, which makes downstream evaluation problematic as it's set to only run eval on finished jobs.

My question is, is the expected behavior, or am I doing something wrong?

Best, Lars

@kptkin kptkin added a:cli Area: Client c:sync Component: Synchronization labels Apr 16, 2024
@anmolmann
Copy link

anmolmann commented Apr 19, 2024

Hi @ankile, this does seem like a bug on our end where a live run is being marked as finished upon being synced. We did fix this issue a few years ago and it seems like this is a regression. Here's the condition in our code which handles this, we'll investigate further internally and identify the root cause.

I was able to reproduce this with the following script as well:

import wandb
import time
import numpy as np
import pandas as pd

run = wandb.init(project="<project_name>")
for i in range(0, 100):
    table = wandb.Table(data=pd.DataFrame(np.random.randint(0, 1001, size=(3, 3)),
                                          columns=['a', 'b', 'c']),
                        columns=['a', 'b', 'c'])
    wandb.log({"table": table})
    time.sleep(5)

wandb.finish()

On a side note, W&B recommends syncing an offline run once it finishes (after run.finish() is called) to avoid running into any error states. The reason is if one tries to sync an active run, the config of the run and some other files are not up-to-date at the point of syncing and only partial data is uploaded. You should see a warning such as:

 WARNING .wandb file is incomplete (record checksum is invalid, data may be corrupt), be sure to sync this run again once it's finished

in your console when syncing an active run.

@ankile
Copy link
Author

ankile commented May 8, 2024

Thank you so much for your response and this info!

Have you had a chance to look more into why this is happening and how to fix it @anmolmann? Alternatively, are there any escape hatches I could implement directly in the wandb code to fix it while we wait for a fix to be released?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:cli Area: Client c:sync Component: Synchronization
Projects
None yet
Development

No branches or pull requests

3 participants