Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbnet在Ascend910B3上训练速度非常慢 #683

Open
zx214 opened this issue Mar 19, 2024 · 6 comments
Open

dbnet在Ascend910B3上训练速度非常慢 #683

zx214 opened this issue Mar 19, 2024 · 6 comments

Comments

@zx214
Copy link

zx214 commented Mar 19, 2024

1、环境说明:Ascend910B3、mindspore2.2.0、CANN 7.0.rc1,训练集1012,测试集数量253,batch为8
2、训练log如下链接:
dbnet_log.log
3、yaml配置文件为:
system:
mode: 0 # 0 for graph mode, 1 for pynative mode in MindSpore
distribute: False
amp_level: 'O0'
seed: 42
log_interval: 100
val_while_train: True
drop_overflow_update: False
ckpt_max_keep: 0

common:
num_epochs: &num_epochs 1
batch_size: &batch_size 1
num_workers: &num_workers 8
ckpt_save_dir: &ckpt_save_dir './tmp_det'
ckpt_load_path: &ckpt_load_path ./tmp_det/best.ckpt
dataset_root: &dataset_root ./dataset_ic15/det
resume: &resume True
lr: &lr 0.007
resume_epochs: 0
data_shape: (736,1280)

model:
type: det
transform: null
backbone:
name: det_resnet50
pretrained: True
neck:
name: DBFPN
out_channels: 256
bias: False
head:
name: DBHead
k: 50
bias: False
adaptive: True
resume: True
pretrained: False

postprocess:
name: DBPostprocess
box_type: quad # whether to output a polygon or a box
binary_thresh: 0.3 # binarization threshold
box_thresh: 0.6 # box score threshold
max_candidates: 1000
expand_ratio: 1.5 # coefficient for expanding predictions

metric:
name: DetMetric
main_indicator: f-score

loss:
name: DBLoss
eps: 1.0e-6
l1_scale: 10
bce_scale: 5
bce_replace: bceloss

scheduler:
scheduler: polynomial_decay
lr: *lr
num_epochs: *num_epochs
decay_rate: 0.9
warmup_epochs: 3

optimizer:
opt: SGD
filter_bias_and_bn: false
momentum: 0.9
weight_decay: 1.0e-4

only used for mixed precision training

loss_scaler:
type: dynamic
loss_scale: 512
scale_factor: 2
scale_window: 1000

train:
ckpt_save_dir: *ckpt_save_dir
dataset_sink_mode: False
gradient_accumulation_steps: 1
dataset:
type: DetDataset
dataset_root: *dataset_root
data_dir: train/images
label_file: train_gt.txt
sample_ratio: 1.0
transform_pipeline:
- DecodeImage:
img_mode: RGB
to_float32: False
- DetLabelEncode:
- RandomColorAdjust:
brightness: 0.1255 # 32.0 / 255
saturation: 0.5
- RandomHorizontalFlip:
p: 0.5
- RandomRotate:
degrees: [ -10, 10 ]
expand_canvas: False
p: 1.0
- RandomScale:
scale_range: [ 0.5, 3.0 ]
p: 1.0
- RandomCropWithBBox:
max_tries: 10
min_crop_ratio: 0.1
crop_size: [ 640, 640 ]
p: 1.0
- ValidatePolygons:
- ShrinkBinaryMap:
min_text_size: 8
shrink_ratio: 0.4
- BorderMap:
shrink_ratio: 0.4
thresh_min: 0.3
thresh_max: 0.7
- NormalizeImage:
bgr_to_rgb: False
is_hwc: True
mean: imagenet
std: imagenet
- ToCHWImage:
# the order of the dataloader list, matching the network input and the input labels for the loss function, and optional data for debug/visualize
output_columns: [ 'image', 'binary_map', 'mask', 'thresh_map', 'thresh_mask' ] #'img_path']

output_columns: ['image'] # for debug op performance

net_input_column_index: [0] # input indices for network forward func in output_columns
label_column_index: [1, 2, 3, 4] # input indices marked as label

loader:
shuffle: True
batch_size: *batch_size
drop_remainder: True
num_workers: *num_workers

eval:
ckpt_load_path: tmp_det/best.ckpt
dataset_sink_mode: False
dataset:
type: DetDataset
dataset_root: *dataset_root
data_dir: val/images
label_file: test_gt.txt
sample_ratio: 1.0
transform_pipeline:
- DecodeImage:
img_mode: RGB
to_float32: False
- DetLabelEncode:
- DetResize: # GridResize 32
target_size: [ 736, 1280 ]
keep_ratio: False
limit_type: none
divisor: 32
- NormalizeImage:
bgr_to_rgb: False
is_hwc: True
mean: imagenet
std: imagenet
- ToCHWImage:
# the order of the dataloader list, matching the network input and the labels for evaluation
output_columns: [ 'image', 'polys', 'ignore_tags', 'shape_list' ]
net_input_column_index: [0] # input indices for network forward func in output_columns
label_column_index: [1, 2] # input indices marked as label

loader:
shuffle: False
batch_size: 1 # TODO: due to dynamic shape of polygons (num of boxes varies), BS has to be 1
drop_remainder: False
num_workers: 2

4、速度慢目前定位为两部分: ①kernel_meta信息编译大概需要一个多小时;②训练速度慢;请问如何解决呢

@zx214 zx214 closed this as completed Mar 19, 2024
@zx214 zx214 reopened this Mar 19, 2024
@panshaowu
Copy link
Collaborator

panshaowu commented Mar 21, 2024

@zx214 您好,感谢您的反馈。

  1. 网络编译耗时超过1小时是不符合预期的。
    建议您将MindSpore升级到r2.2.12或r2.2.11,并安装匹配的Ascend CANN包。r2.2.12相较于r2.2.0修复了部分bug,并对性能进行了优化。我们在相同的硬件设备上,使用MindSpore r2.2.11训练DBNet ResNet-50,编译耗时在1-2分钟左右。

  2. 训练速度慢。
    首先,注意到您的训练epoch数为1,在训练起始阶段,由于需要进行编译加载等操作,会显著较后续epoch慢。
    其次,建议将train.dataset_sink_mode设置为True,打开数据下沉模式能够显著提升训练效能。
    第三,如果您的CPU单核能力较弱,可尝试将train.dataset.use_minddata设置为True,或尝试适当调高train.loader.num_workers的数值(太高会有反作用),可提升数据预处理计算效率;
    最后,可尝试调整system.amp_levelO2O3,将部分运算切换为Ascend硬件亲和性更好的fp16模式,但部分网络的收敛难度会变大(甚至会不收敛),需要配套调整超参数和train.clip_gradtrain.clip_norm等选项。

以下是我们在相同的硬件设备上,使用MindSpore r2.2.11训练DBNet ResNet-50的日志,供您参考:

(MindSpore) [ma-user mindocr-main]$DEVICE_ID=2 python tools/train.py -c configs/det/dbnet/db_r50_icdar15.yaml
[2024-03-21 16:42:06] mindocr.train INFO - Standalone training. Device id: 2, specified by environment variable 'DEVICE_ID'.
[2024-03-21 16:42:11] mindocr.data.builder INFO - Creating dataloader (training=True) for device 0. Number of data samples: 1000 per device (1000 total).
[2024-03-21 16:42:39] mindocr.models.backbones.mindcv_models.utils INFO - Finish loading model checkpoint from: /home/ma-user/.mindspore/models/resnet50-e0733ab8.ckpt
[2024-03-21 16:42:39] mindocr.utils.loss_scaler INFO - gradient_accumulation_steps > 1, scale_factor and scale_window are adjusted accordingly for dynamic loss scaler
[2024-03-21 16:42:39] mindocr.optim.param_grouping INFO - no parameter grouping is applied.
[2024-03-21 16:42:48] mindocr.train INFO -
========================================
Distribute: False
Model: det_resnet50-DBFPN-DBHead
Total number of parameters: 25463362
Total number of trainable parameters: 25409730
Data root: /home/ma-user/work/evaluation-sh/MindOCR/mindocr-main/dataset/
Optimizer: SGD
Weight decay: 0.0001
Batch size: 10
Num devices: 1
Gradient accumulation steps: 2
Global batch size: 10x1x2=20
LR: 0.007
Scheduler: polynomial_decay
Steps per epoch: 100
Num epochs: 1200
Clip gradient: False
EMA: False
AMP level: O0
Loss scaler: {'type': 'dynamic', 'loss_scale': 512, 'scale_factor': 2, 'scale_window': 1000}
Drop overflow update: False
========================================

Start training... (The first epoch takes longer, please wait...)

[2024-03-21 16:44:03] mindocr.utils.callbacks INFO - epoch: [1/1200], loss: 5.056544, epoch time: 74.750 s, per step time: 747.501 ms, fps per card: 13.38 img/s
[2024-03-21 16:44:03] mindocr.utils.callbacks INFO - => Best train_loss: 5.056544303894043, checkpoint saved.
[2024-03-21 16:44:22] mindocr.utils.callbacks INFO - epoch: [2/1200], loss: 4.930840, epoch time: 15.967 s, per step time: 159.667 ms, fps per card: 62.63 img/s
[2024-03-21 16:44:22] mindocr.utils.callbacks INFO - => Best train_loss: 4.930840015411377, checkpoint saved.
[2024-03-21 16:44:41] mindocr.utils.callbacks INFO - epoch: [3/1200], loss: 4.721155, epoch time: 15.963 s, per step time: 159.631 ms, fps per card: 62.64 img/s
[2024-03-21 16:44:42] mindocr.utils.callbacks INFO - => Best train_loss: 4.721155166625977, checkpoint saved.
[2024-03-21 16:45:00] mindocr.utils.callbacks INFO - epoch: [4/1200], loss: 4.198533, epoch time: 15.969 s, per step time: 159.689 ms, fps per card: 62.62 img/s
[2024-03-21 16:45:01] mindocr.utils.callbacks INFO - => Best train_loss: 4.198533058166504, checkpoint saved.
[2024-03-21 16:45:19] mindocr.utils.callbacks INFO - epoch: [5/1200], loss: 4.210309, epoch time: 15.959 s, per step time: 159.593 ms, fps per card: 62.66 img/s
[2024-03-21 16:45:38] mindocr.utils.callbacks INFO - epoch: [6/1200], loss: 3.534603, epoch time: 15.959 s, per step time: 159.588 ms, fps per card: 62.66 img/s
[2024-03-21 16:45:38] mindocr.utils.callbacks INFO - => Best train_loss: 3.5346031188964844, checkpoint saved.
[2024-03-21 16:45:57] mindocr.utils.callbacks INFO - epoch: [7/1200], loss: 3.849508, epoch time: 15.965 s, per step time: 159.652 ms, fps per card: 62.64 img/s
[2024-03-21 16:46:15] mindocr.utils.callbacks INFO - epoch: [8/1200], loss: 3.630200, epoch time: 15.958 s, per step time: 159.576 ms, fps per card: 62.67 img/s
[2024-03-21 16:46:34] mindocr.utils.callbacks INFO - epoch: [9/1200], loss: 3.526414, epoch time: 15.963 s, per step time: 159.625 ms, fps per card: 62.65 img/s
[2024-03-21 16:46:34] mindocr.utils.callbacks INFO - => Best train_loss: 3.526414394378662, checkpoint saved.
[2024-03-21 16:46:53] mindocr.utils.callbacks INFO - epoch: [10/1200], loss: 3.429994, epoch time: 15.957 s, per step time: 159.572 ms, fps per card: 62.67 img/s
[2024-03-21 16:46:53] mindocr.utils.callbacks INFO - => Best train_loss: 3.4299941062927246, checkpoint saved.
[2024-03-21 16:47:12] mindocr.utils.callbacks INFO - epoch: [11/1200], loss: 2.924218, epoch time: 15.966 s, per step time: 159.657 ms, fps per card: 62.63 img/s
[2024-03-21 16:47:12] mindocr.utils.callbacks INFO - => Best train_loss: 2.92421817779541, checkpoint saved.
[2024-03-21 16:47:31] mindocr.utils.callbacks INFO - epoch: [12/1200], loss: 3.281451, epoch time: 15.956 s, per step time: 159.565 ms, fps per card: 62.67 img/s
[2024-03-21 16:47:49] mindocr.utils.callbacks INFO - epoch: [13/1200], loss: 3.589340, epoch time: 15.958 s, per step time: 159.577 ms, fps per card: 62.67 img/s
[2024-03-21 16:48:07] mindocr.utils.callbacks INFO - epoch: [14/1200], loss: 3.343097, epoch time: 15.958 s, per step time: 159.584 ms, fps per card: 62.66 img/s
[2024-03-21 16:48:26] mindocr.utils.callbacks INFO - epoch: [15/1200], loss: 3.465981, epoch time: 15.956 s, per step time: 159.561 ms, fps per card: 62.67 img/s
[2024-03-21 16:48:44] mindocr.utils.callbacks INFO - epoch: [16/1200], loss: 3.380215, epoch time: 15.960 s, per step time: 159.596 ms, fps per card: 62.66 img/s
[2024-03-21 16:49:03] mindocr.utils.callbacks INFO - epoch: [17/1200], loss: 3.600936, epoch time: 15.959 s, per step time: 159.589 ms, fps per card: 62.66 img/s
[2024-03-21 16:49:21] mindocr.utils.callbacks INFO - epoch: [18/1200], loss: 3.199931, epoch time: 15.958 s, per step time: 159.581 ms, fps per card: 62.66 img/s
[2024-03-21 16:49:40] mindocr.utils.callbacks INFO - epoch: [19/1200], loss: 3.428349, epoch time: 15.957 s, per step time: 159.572 ms, fps per card: 62.67 img/s
[2024-03-21 16:49:59] mindocr.utils.callbacks INFO - epoch: [20/1200], loss: 3.225470, epoch time: 15.958 s, per step time: 159.576 ms, fps per card: 62.67 img/s

@zx214
Copy link
Author

zx214 commented Mar 22, 2024

感谢,我来试一下

@zx214
Copy link
Author

zx214 commented Mar 22, 2024

哈喽,安装官网安装了mindspore2.2.12和CANN7.0.0.beta1,测试源代码模型编译时间没问题,根据任务需求添加了summary的记录和修改数据读取部分,模型部分没有动;但是模型编译时间还是很长,请问模型编译速度慢可能因为什么影响呢

@panshaowu
Copy link
Collaborator

panshaowu commented Mar 22, 2024

@zx214 您好。根据您的描述,似乎是从启动Python脚本,到开始训练计算的耗时较长,而不一定是MindSpore的静态图模型编译耗时较长。

建议您考虑使用以下方案,定界出真正耗时较长的代码:

  1. 将新增的Summary模块、数据加载模块相关代码,从项目中剥离出来,进行单元测试,测算耗时;
  2. 在项目代码tools/main.py中,使用Python的time包,计时分析各阶段耗时。

根据您的描述,耗时较长的原因可能有:

  1. Summary写入过多冗余信息,导致耗时过长;
  2. 加载数据的效率较低,如数据集的单样本分辨率很高,或硬盘读写速率较低;
  3. MindSpore Graph/静态图首次执行图编译会消耗相对多的时间,非首次执行图编译则会复用之前生成的编译缓存。如果您每次手工删除了编译缓存,会导致编译时间变长。

@zx214
Copy link
Author

zx214 commented Mar 22, 2024

好的,我定位一下,再向您请教个问题在2卡训练中,输出这样的日志,怎样原因导致的,如何解决呢
image

@panshaowu
Copy link
Collaborator

@zx214 您好,根据您反馈的错误日志。因为MindSpore源代码中未包含相关错误信息,该问题可能不是MindSpore引入的。建议您检查一下Ascend的环境配置,包括硬件是否安装正常,CANN包是否安装正确(例如安装完CANN后是否安装了te和hccl的whl包)。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants