Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[App]: hardware logging in multi-node setting #7470

Open
BramVanroy opened this issue Apr 24, 2024 · 6 comments
Open

[App]: hardware logging in multi-node setting #7470

BramVanroy opened this issue Apr 24, 2024 · 6 comments
Labels
a:cli Area: Client c:system-metrics ty:feature_request type of the issue is a feature request

Comments

@BramVanroy
Copy link

Current Behavior

Currently, in th e run overview, we can get an idea of the system hardware, specifically GPU count and CPU count. However, as far as I can tell this does not account for multi-node settings and only reports what the current node is equipped with. While I understand why that is the case, it may be confusing because it is not "correct".

Expected Behavior

Correct hardware information. To be honest I am not sure how feasible it is to collect this information without integration with distributed communication frameworks or something else custom.

Steps To Reproduce

No response

Screenshots

No response

Environment

OS: Linux

Browsers: Edge

Additional Context

No response

@thanos-wandb
Copy link
Contributor

Hi @BramVanroy thank you for reporting this. May I please ask some more context, what's your current compute infra? and which ML frameworks are you mostly using?

@kptkin kptkin added c:system-metrics ty:feature_request type of the issue is a feature request a:cli Area: Client labels Apr 27, 2024
@thanos-wandb
Copy link
Contributor

Hi @BramVanroy just following up on this, to see if you could provide us with some additional information on your current multinode infrastructure so as to include those in a feature request for our engineers? thank you!

@BramVanroy
Copy link
Author

Hi Thanos

I am running jobs on between 1 node, 1 GPU up to 10 nodes, 4 GPUs each. It seems to me that wandb does not correctly log hardware when it comes to multi-node settings.

@thanos-wandb
Copy link
Contributor

Perfect, thank you @BramVanroy for the additional context. I was wondering what's reported in those runs, if you navigate in Files view and open the wandb-metadata.json file, under the "cpu_count" and "cpu_count_logical" entries. Does it not detect the correct hw info when it's multinode?

@BramVanroy
Copy link
Author

Correct. It only reports the main node hardware configuration, but not the whole pool.

@thanos-wandb
Copy link
Contributor

Great, thank you @BramVanroy for the clarification. I have logged this feature request with our engineers, and we will keep you updated here on any progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:cli Area: Client c:system-metrics ty:feature_request type of the issue is a feature request
Projects
None yet
Development

No branches or pull requests

3 participants