-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read {run_id}_progress from DHT manually throws exceptions #533
Comments
So I found the issue - adding additional validators to the DHT is necessary to parse LocalTrainingProgress. from hivemind.dht.schema import (
BytesWithPublicKey,
RSASignatureValidator,
SchemaValidator)
from pydantic import BaseModel, StrictBool, StrictFloat, confloat, conint
class LocalTrainingProgress(BaseModel):
peer_id: bytes
epoch: conint(ge=0, strict=True)
samples_accumulated: conint(ge=0, strict=True)
samples_per_second: confloat(ge=0.0, strict=True)
time: StrictFloat
client_mode: StrictBool
class TrainingProgressSchema(BaseModel):
progress: Dict[BytesWithPublicKey, Optional[LocalTrainingProgress]]
run_id = (...get run_id)
dht = (...init dht)
signature_validator = RSASignatureValidator(None)
local_public_key = signature_validator.local_public_key
dht.add_validators(
[
SchemaValidator(TrainingProgressSchema, prefix=f"{run_id}"),
signature_validator,
]
)
metadata, expiration = dht.get(key=f"{run_id}_progress", return_future=False) I'm planning to create a pull request to update the documentation with a full example to access the GlobalTrainingProgress. You're welcome to either keep the issue open for me to reference it or close it. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
I can't seem to be able to read the training information (like here) out of the DHT that was created by hivemind.
I can connect to the DHT and run the following:
However, when training with hivemind, I can't seem to be able to get the data with two different behaviors after calling the
get
function after each other.Only the second call shows some actual training progress data, but not complete (1 out of 4 peers) and not in a way that allows me to access it compared to the documentation.
It seems that there is some issue with the
get
call being run asynchronously and not being able to decode the returning LocalTrainingProgress.How does the tutorial data
get/store
differ from what hivemind does with the LocalTrainingProgress?First call to
get
Second call to
get
The text was updated successfully, but these errors were encountered: