Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Job "Successful" despite failing due to 100% disk usage #204

Open
david-waterworth opened this issue Nov 8, 2023 · 0 comments
Open

Comments

@david-waterworth
Copy link

david-waterworth commented Nov 8, 2023

Describe the bug
I ran a training job as part of a sagemaker pipeline. The model by default wrote checkpoints and after epoch 2 of 10 disck utilisation reached 100%.

Despite abnormal exit from the training script, the training job and hence pipeline step was reported as successful.

To reproduce
I used the HuggingFace estimator with the following parameters

instance_type="ml.g4dn.xlarge",
role=role,
transformers_version="4.28",
pytorch_version="2.0", 
py_version="py310", 

The model is a sentence-transformers model (installed using requirements.txt). I inadvertently enabled checkpoints hence the out of disk issue.

Cloudwatch logs indicate abnormal termination, i.e.

2023-11-07T11:49:54.665000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 2023-11-07 11:49:54 - Save model to /opt/ml/checkpoints/242000
2023-11-07T11:49:54.665000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 Epoch:  20%|██        | 2/10 [13:42:59<39:37:08, 17828.58s/it]
2023-11-07T11:49:54.665000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 #015Iteration:  77%|███████▋  | 67255/87372 [3:48:53<1:08:41,  4.88it/s]#033[A
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 Iteration:  77%|███████▋  | 67255/87372 [3:48:54<1:08:28,  4.90it/s]
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 Epoch:  20%|██        | 2/10 [13:43:01<54:52:04, 24690.56s/it]
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ /opt/conda/lib/python3.10/site-packages/torch/serialization.py:441 in save   │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │                                                                              │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    438 │                                                                     │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    439 │   if _use_new_zipfile_serialization:                                │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    440 │   │   with _open_zipfile_writer(f) as opened_zipfile:               │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ ❱  441 │   │   │   _save(obj, opened_zipfile, pickle_module, pickle_protocol │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    442 │   │   │   return                                                    │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    443 │   else:                                                             │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    444 │   │   with _open_file_like(f, 'wb') as opened_file:                 │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │                                                                              │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ /opt/conda/lib/python3.10/site-packages/torch/serialization.py:668 in _save  │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │                                                                              │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    665 │   │   │   storage = storage.cpu()                                   │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    666 │   │   # Now that it is on the CPU we can directly copy it into the  │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    667 │   │   num_bytes = storage.nbytes()                                  │

The training job charts show the disk utilisation hitting 100%

image

But the training job status is "complete", the abnormal termination wasn't detected.

image

Expected behavior
Sagemaker pipeline steps shouldn't report success unless the script terminated normally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant