-
-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model Fit deadlocks when training in SageMaker with PipeModeDataset #186
Comments
@murphycrosby have you ever had the chance to try on a more recent version of TF? |
@murphycrosby Have you tried swapping out the TCN for another model to see if the issue persists? I'd be curious to see if it were actually related to the TCN training in particular or if it's an oddity related to PipeModeDataset. I would also like to see how you are setting up model training in SageMaker and what instance, options, etc. you are using. It sounds sort of similar to this issue here which is external to the TCN portion of the code. I don't know that it matters much for a normal TFRecordsDataset but order of operations might matter on the PipeModeDataset. You could try rearranging the parse/prefetch/batch ops to match the AWS example here. Lastly, you could swap to the new FastFile Mode with something like this example here. |
Model Fit deadlocks when training on SageMaker with PipeModeDataset. CPUUtilization, MemoryUtilization, DiskUtilization all drop to 0 on the training instance. The model works fine when you swap out PipeModeDataset with tf.data.TFRecordDataset. The for loop proves that the dataset batch has been downloaded.
Example TCN Model
tensorflow==2.3.1
Output
The text was updated successfully, but these errors were encountered: