Training Job "Successful" despite failing due to 100% disk usage #204

david-waterworth · 2023-11-08T23:38:26Z

Describe the bug
I ran a training job as part of a sagemaker pipeline. The model by default wrote checkpoints and after epoch 2 of 10 disck utilisation reached 100%.

Despite abnormal exit from the training script, the training job and hence pipeline step was reported as successful.

To reproduce
I used the HuggingFace estimator with the following parameters

instance_type="ml.g4dn.xlarge",
role=role,
transformers_version="4.28",
pytorch_version="2.0", 
py_version="py310",

The model is a sentence-transformers model (installed using requirements.txt). I inadvertently enabled checkpoints hence the out of disk issue.

Cloudwatch logs indicate abnormal termination, i.e.

2023-11-07T11:49:54.665000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 2023-11-07 11:49:54 - Save model to /opt/ml/checkpoints/242000
2023-11-07T11:49:54.665000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 Epoch:  20%|██        | 2/10 [13:42:59<39:37:08, 17828.58s/it]
2023-11-07T11:49:54.665000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 #015Iteration:  77%|███████▋  | 67255/87372 [3:48:53<1:08:41,  4.88it/s]#033[A
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 Iteration:  77%|███████▋  | 67255/87372 [3:48:54<1:08:28,  4.90it/s]
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 Epoch:  20%|██        | 2/10 [13:43:01<54:52:04, 24690.56s/it]
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ /opt/conda/lib/python3.10/site-packages/torch/serialization.py:441 in save   │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │                                                                              │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    438 │                                                                     │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    439 │   if _use_new_zipfile_serialization:                                │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    440 │   │   with _open_zipfile_writer(f) as opened_zipfile:               │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ ❱  441 │   │   │   _save(obj, opened_zipfile, pickle_module, pickle_protocol │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    442 │   │   │   return                                                    │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    443 │   else:                                                             │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    444 │   │   with _open_file_like(f, 'wb') as opened_file:                 │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │                                                                              │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ /opt/conda/lib/python3.10/site-packages/torch/serialization.py:668 in _save  │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │                                                                              │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    665 │   │   │   storage = storage.cpu()                                   │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    666 │   │   # Now that it is on the CPU we can directly copy it into the  │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    667 │   │   num_bytes = storage.nbytes()                                  │

The training job charts show the disk utilisation hitting 100%

But the training job status is "complete", the abnormal termination wasn't detected.

Expected behavior
Sagemaker pipeline steps shouldn't report success unless the script terminated normally.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Job "Successful" despite failing due to 100% disk usage #204

Training Job "Successful" despite failing due to 100% disk usage #204

david-waterworth commented Nov 8, 2023 •

edited

Training Job "Successful" despite failing due to 100% disk usage #204

Training Job "Successful" despite failing due to 100% disk usage #204

Comments

david-waterworth commented Nov 8, 2023 • edited

david-waterworth commented Nov 8, 2023 •

edited