Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: __init__() got an unexpected keyword argument 'flags' #463

Open
michelleqyhqyh opened this issue Feb 22, 2023 · 5 comments
Open

Comments

@michelleqyhqyh
Copy link

michelleqyhqyh commented Feb 22, 2023

I want to run T5 example. This is my command. But there is an error. How can I fix it?

export CUDA_VISIBLE_DEVICES=2,3
bash tools/train.sh tools/train_net.py projects/T5/configs/mt5_pretrain.py 2

bash: /home/qyh/anaconda3/envs/syl-env/lib/libtinfo.so.6: no version information available (required by bash)


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


W20230222 16:59:05.227458 4178164 rpc_client.cpp:190] LoadServer 127.0.0.1 Failed at 0 times error_code 14 error_message failed to connect to all addresses
Traceback (most recent call last):
File "/resources/qyh/big-model/libai/tools/train_net.py", line 25, in
from libai.config import LazyConfig, default_argument_parser, try_get_key
File "/resources/qyh/big-model/libai/libai/init.py", line 20, in
from libai import data
File "/resources/qyh/big-model/libai/libai/data/init.py", line 17, in
from .build import (
File "/resources/qyh/big-model/libai/libai/data/build.py", line 35, in
train_sampler=LazyCall(CyclicSampler)(shuffle=True),
File "/resources/qyh/big-model/libai/libai/config/lazy.py", line 123, in call
return DictConfig(content=kwargs, flags={"allow_objects": True})
TypeError: init() got an unexpected keyword argument 'flags'
Traceback (most recent call last):
File "/resources/qyh/big-model/libai/tools/train_net.py", line 25, in
from libai.config import LazyConfig, default_argument_parser, try_get_key
File "/resources/qyh/big-model/libai/libai/init.py", line 20, in
from libai import data
File "/resources/qyh/big-model/libai/libai/data/init.py", line 17, in
from .build import (
File "/resources/qyh/big-model/libai/libai/data/build.py", line 35, in
train_sampler=LazyCall(CyclicSampler)(shuffle=True),
File "/resources/qyh/big-model/libai/libai/config/lazy.py", line 123, in call
return DictConfig(content=kwargs, flags={"allow_objects": True})
TypeError: init() got an unexpected keyword argument 'flags'
Killing subprocess 4172867
Killing subprocess 4172868
Traceback (most recent call last):
File "/home/qyh/anaconda3/envs/py39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/qyh/anaconda3/envs/py39/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/qyh/anaconda3/envs/py39/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 240, in
main()
File "/home/qyh/anaconda3/envs/py39/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 228, in main
sigkill_handler(signal.SIGTERM, None)
File "/home/qyh/anaconda3/envs/py39/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 196, in sigkill_handler
raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['/home/qyh/anaconda3/envs/py39/bin/python3', '-u', 'tools/train_net.py', '--config-file', 'projects/T5/configs/mt5_pretrain.py']' returned non-zero exit status 1.

@xiezipeng-ML
Copy link
Contributor

check your omegaconf version==2.1.0?

@michelleqyhqyh
Copy link
Author

omegaconf

I update the omegaconf to 2.1.0. But there is another error:

bash tools/train.sh tools/train_net.py projects/T5/configs/mt
5_pretrain.py 2
bash: /home/qyh/anaconda3/envs/syl-env/lib/libtinfo.so.6: no version information available (required by bas
h)


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system bein
g overloaded, please further tune the variable for optimal performance in your application as needed.


[02/22 17:17:52 libai]: Rank of current process: 0. World size: 2
[02/22 17:17:52 libai]: Command line arguments: Namespace(config_file='projects/T5/configs/mt5_pretrain.py', resume=False, eval_only=False, fast_dev_run=False, opts=[])
[02/22 17:17:53 libai]: Contents of args.config_file=projects/T5/configs/mt5_pretrain.py:
from libai import evaluation
from libai.data.build import build_nlp_train_loader
from omegaconf import OmegaConf

from libai.config import LazyCall
from libai.evaluation import PPLEvaluator, evaluator
from libai.scheduler import WarmupExponentialLR

from configs.common.train import train
from configs.common.models.graph import graph

from projects.T5.configs.optim import optim
from projects.T5.configs.t5_model_config import cfg
from projects.T5.datasets.dataset import UnsuperviseT5Dataset, collate_fn
from projects.T5.models.t5_model import T5ForPreTraining

train_data_path = "projects/T5/data/training_data/part_0"
pretrained_model_path = "projects/T5"

micro_batch_size = 64
optim["lr"] = 1e-4

dataloader

dataloader = OmegaConf.create()
dataloader.train = LazyCall(build_nlp_train_loader)(
dataset=[
LazyCall(UnsuperviseT5Dataset)(
data_path=train_data_path,
)
],
collate_fn=collate_fn(
vocab_size=12902,
max_seq_length=512,
noise_density=0.15,
mean_noise_span_length=3,
eos_token_id=12801,
pad_token_id=0,
decoder_start_token_id=12800,
),
)

model = LazyCall(T5ForPreTraining)(cfg=cfg)

model config

model.cfg.vocab_size = 12902
model.cfg.hidden_size = 512
model.cfg.hidden_layers = 8
model.cfg.num_attention_heads = 6
model.cfg.head_size = 64
model.cfg.intermediate_size = 1024
model.cfg.hidden_dropout_prob = 0.0
model.cfg.attention_probs_dropout_prob = 0.0
model.cfg.embedding_dropout_prob = 0.0
model.cfg.layernorm_eps = 1e-6
model.cfg.model_type = "mt5"
model.cfg.pretrained_model_path = pretrained_model_path
dict(
output_dir="projects/T5/output/mt5_output",
train_micro_batch_size=micro_batch_size,
train_epoch=1,
train_iter=24000,
log_period=10,
amp=dict(enabled=False),
warmup_ratio=1 / 24,
# checkpointer=dict(period=10, max_to_keep=20),
dist=dict(
data_parallel_size=2,
tensor_parallel_size=2,
pipeline_parallel_size=1,
pipeline_num_layers=2 * model.cfg.hidden_layers,
),
scheduler=LazyCall(WarmupExponentialLR)(
warmup_factor=0.001,
gamma=1.0,
warmup_method="linear",
warmup_iter=0.0,
),
evaluation=dict(
evaluator=LazyCall(PPLEvaluator)(),
enabled=True,
eval_iter=1e5,
eval_period=5000,
),
)
)

train.zero_optimization.enabled = True
train.zero_optimization.stage = 2

[02/22 17:17:53 libai]: Full config saved to projects/T5/output/mt5_output/config.yaml
[02/22 17:17:53 lb.engine.default]: > compiling dataset index builder ...
make: 进入目录“/resources/qyh/big-model/libai/libai/data/data_utils”
g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/home/qyh/anaconda3/envs/py39/include/python3.9 -I/home/qyh/anaconda3/envs/py39/lib/python3.9/site-packages/pybind11/include helpers.cpp -o helpers.cpython-39-x86_64-linux-gnu.so
make: 离开目录/resources/qyh/big-model/libai/libai/data/data_utils??
[02/22 17:18:00 lb.engine.default]: >>> done with dataset index builder. Compilation time: 7.723 seconds
[02/22 17:18:00 lb.engine.default]: >>> done with compiling. Compilation time: 7.725 seconds
[02/22 17:18:02 lb.engine.default]: Prepare training, validating, testing set
libi40iw-i40iw_ucreate_cq: failed to initialize CQ, status -16
F20230222 17:18:08.801565 443 ibverbs_comm_network.cpp:140] Check failed: cq_ : No space left on device [28]
*** Check failure stack trace: ***
@ 0x7fdb2e63f9ca google::LogMessage::Fail()
@ 0x7fdb2e63fcb2 google::LogMessage::SendToLog()
@ 0x7fdb2e63f537 google::LogMessage::Flush()
@ 0x7fdb2e640b76 google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7fdb23f13167 oneflow::IBVerbsCommNet::IBVerbsCommNet()
@ 0x7fdb27dc7c60 oneflow::InitRDMA()
@ 0x7fdc05bc6733 (unknown)
@ 0x7fdc05b61e0d (unknown)
@ 0x507457 cfunction_call
@ 0x4f068c _PyObject_MakeTpCall
@ 0x4ec9fb _PyEval_EvalFrameDefault
@ 0x4f7f33 function_code_fastcall
@ 0x4ec4d4 _PyEval_EvalFrameDefault
@ 0x4e689a _PyEval_EvalCode
@ 0x4efefe _PyObject_FastCallDictTstate
@ 0x502691 slot_tp_init
@ 0x4f06a3 _PyObject_MakeTpCall
@ 0x4ec2fa _PyEval_EvalFrameDefault
@ 0x4f7f33 function_code_fastcall
@ 0x4e7b6c _PyEval_EvalFrameDefault
@ 0x4e689a _PyEval_EvalCode
@ 0x4e6527 _PyEval_EvalCodeWithName
@ 0x4e64d9 PyEval_EvalCodeEx
@ 0x59329b PyEval_EvalCode
@ 0x5c0ad7 run_eval_code_obj
@ 0x5bcb00 run_mod
@ 0x4566f4 pyrun_file.cold
@ 0x5b67e2 PyRun_SimpleFileExFlags
@ 0x5b3d5e Py_RunMain
@ 0x587349 Py_BytesMain
@ 0x7fdc1d219083 _libc_start_main
@ 0x5871fe (unknown)
libi40iw-i40iw_ucreate_cq: failed to initialize CQ, status -16
F20230222 17:18:09.039657 444 ibverbs_comm_network.cpp:140] Check failed: cq
: No space left on device [28]
*** Check failure stack trace: ***
@ 0x7fb5005949ca google::LogMessage::Fail()
@ 0x7fb500594cb2 google::LogMessage::SendToLog()
@ 0x7fb500594537 google::LogMessage::Flush()
@ 0x7fb500595b76 google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7fb4f5e68167 oneflow::IBVerbsCommNet::IBVerbsCommNet()
@ 0x7fb4f9d1cc60 oneflow::InitRDMA()
@ 0x7fb5d7b1b733 (unknown)
@ 0x7fb5d7ab6e0d (unknown)
@ 0x507457 cfunction_call
@ 0x4f068c _PyObject_MakeTpCall
@ 0x4ec9fb _PyEval_EvalFrameDefault
@ 0x4f7f33 function_code_fastcall
@ 0x4ec4d4 _PyEval_EvalFrameDefault
@ 0x4e689a _PyEval_EvalCode
@ 0x4efefe _PyObject_FastCallDictTstate
@ 0x502691 slot_tp_init
@ 0x4f06a3 _PyObject_MakeTpCall
@ 0x4ec2fa _PyEval_EvalFrameDefault
@ 0x4f7f33 function_code_fastcall
@ 0x4e7b6c _PyEval_EvalFrameDefault
@ 0x4e689a _PyEval_EvalCode
@ 0x4e6527 _PyEval_EvalCodeWithName
@ 0x4e64d9 PyEval_EvalCodeEx
@ 0x59329b PyEval_EvalCode
@ 0x5c0ad7 run_eval_code_obj
@ 0x5bcb00 run_mod
@ 0x4566f4 pyrun_file.cold
@ 0x5b67e2 PyRun_SimpleFileExFlags
@ 0x5b3d5e Py_RunMain
@ 0x587349 Py_BytesMain
@ 0x7fb5ef16e083 __libc_start_main
@ 0x5871fe (unknown)
Killing subprocess 443
Killing subprocess 444
Traceback (most recent call last):
File "/home/qyh/anaconda3/envs/py39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/qyh/anaconda3/envs/py39/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/qyh/anaconda3/envs/py39/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 240, in
main()
File "/home/qyh/anaconda3/envs/py39/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 228, in main
sigkill_handler(signal.SIGTERM, None)
File "/home/qyh/anaconda3/envs/py39/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 196, in sigkill_handler
raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['/home/qyh/anaconda3/envs/py39/bin/python3', '-u', 'tools/train_net.py', '--config-file', 'projects/T5/configs/mt5_pretrain.py']' died with <Signals.SIGABRT: 6>.

@xiezipeng-ML
Copy link
Contributor

可以安装一下最新的oneflow:python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/cu116
然后libai用pip install -e .的方式来安装试试呢

@michelleqyhqyh
Copy link
Author

  1. 我用一块GPU完成了T5的训练,但是T5的模型参数量不是110亿吗?用一块V100怎么会加载完这么大的模型?
  2. 我训练完之后该用怎么代码来测试效果呀?

@xiezipeng-ML
Copy link
Contributor

  1. 我用一块GPU完成了T5的训练,但是T5的模型参数量不是110亿吗?用一块V100怎么会加载完这么大的模型?

可以确认一下自己训练的时候用的什么规模的模型配置

  1. 我训练完之后该用怎么代码来测试效果呀?

libai里和其他的库比如megatron提供的都是模型的预训练任务,所以测试效果可以在测试集上跑一下预训练任务的指标,如果希望训练出完整的T5,也就是达到libai中利用T5权重做推理任务的话,还需要在多个下游任务上finetune预训练模型后测试效果

@michelleqyhqyh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants