-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
训练SwinTransformer模型,loss不下降 #3098
Comments
部分训练日志:[2024/03/01 09:07:55] ppcls INFO: [Train][Epoch 65/100][Iter: 0/5005]lr(LinearWarmup): 0.00022245, CELoss: 3.61639, loss: 3.61639, batch_cost: 0.53438s, reader_cost: 0.00619, ips: 119.76526 samples/s, eta: 1 day, 2:44:44 |
training_script_args: ['-c', 'ppcls/configs/ImageNet/SwinTransformer/SwinTransformer_base_patch4_window7_224.yaml', '-o', 'Global.device=gpu', '-o', 'Global.use_dali=False', '-o', 'Global.epochs=100', '-o', 'Global.save_interval=10', '-o', 'Global.use_visualdl=True', '-o', 'DataLoader.Train.sampler.batch_size=64', '-o', 'Global.output_dir=./output/SwinTransformer_O1_mlu_16chips'] |
请问训练使用的什么数据集呢 |
数据集:ILSVRC2012数据集imagenet_train |
看你这边减半了batch_size,相应的learning_rate也要减半哈 |
请问能提供一下这个模型的训练日志,测试脚本,loss下降等相关参考吗? |
这个建议参考官方提供的示例哈,或者你可以把你的配置贴上,我这边看下呢 |
python -m paddle.distributed.launch --ips="127.0.0.1" --devices="0,1,2,3,4,5,6,7," tools/train.py |
开启了amp O1了是吗 |
使用https://github.com/PaddlePaddle/PaddleClas代码,develop分支,训练SwinTransformer模型,不收敛,loss曲线一直上升。
硬件环境:8卡A100 80G 和4卡A10
CUDA版本:12.0和11.7
paddle版本:paddlepaddle-gpu 2.6.0.post117
操作系统:ubuntu18.04
训练脚本:
python
-m paddle.distributed.launch
--ips="127.0.0.1"
--devices="0,1,2,3,4,5,6,7"
tools/train.py
-c ppcls/configs/ImageNet/SwinTransformer/SwinTransformer_base_patch4_window7_224.yaml
现象:训练开始后,loss从2.6左右开始上升,一直呈上升趋势。
问题排查,在训练超参中,添加了pretrain_mode,设置为null和指定模型,现象一样。
请排查该问题,确保训练能正常执行,loss下降。
The text was updated successfully, but these errors were encountered: