Skip to content

PyTorchJob 的方式在 Kubernetes 多机多卡启动 libai 执行 #367

Answered by xiezipeng-ML
strint asked this question in Q&A
Discussion options

You must be logged in to vote

参照李jing的文档:

使用training-operator运行libai分布式训练

1、 在挂载目录/home/data中安装libai,并且下载bert训练数据:

cd /home/data
git clone https://github.com/Oneflow-Inc/libai.git

cd libai
pip install -e .

mkdir -p data_test/bert_data
cd data_test/bert_data

wget https://oneflow-static.oss-cn-beijing.aliyuncs.com/ci-files/dataset/libai/bert_dataset/bert-base-chinese-vocab.txt
wget https://oneflow-static.oss-cn-beijing.aliyuncs.com/ci-files/dataset/libai/bert_dataset/loss_compara_content_sentence.bin
wget https://oneflow-static.oss-cn-beijing.aliyuncs.com/ci-files/dataset/libai/bert_dataset/loss_compara_content_sentence.idx

2、rdma用不了,在configs/bert_large_pretrain.py后添加:

train.rdma_enabled = False

3、修改yaml文件:

apiVersion: 

Replies: 8 comments 2 replies

Comment options

strint
Aug 24, 2022
Maintainer Author

You must be logged in to vote
0 replies
Comment options

strint
Aug 24, 2022
Maintainer Author

You must be logged in to vote
0 replies
Comment options

strint
Aug 24, 2022
Maintainer Author

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

strint
Aug 29, 2022
Maintainer Author

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
2 replies
@strint
Comment options

strint Aug 30, 2022
Maintainer Author

@xiezipeng-ML
Comment options

Answer selected by strint
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
5 participants