Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Lora trianing Missing key in state_dict #193

Open
didadida-r opened this issue May 14, 2024 · 2 comments
Open

[BUG] Lora trianing Missing key in state_dict #193

didadida-r opened this issue May 14, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@didadida-r
Copy link

Feel free to ask any kind of questions in the issues page, but please use English since other users may find your questions valuable.

Describe the bug
Hi,

i follow the doc finetuning and add the lora parameter, but it fails to train with missing state_dict error, Thanks!

If you want to use LoRA, please add the following parameter: [email protected]_config=r_8_alpha_16

To Reproduce
Steps to reproduce the behavior:

python fish_speech/train.py \
    --config-name text2semantic_ntes_finetune_44k_ar2 \
    [email protected]=dual_ar_2_codebook_medium \
    [email protected]_config=r_8_alpha_16

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots / log

[2024-05-13 16:57:07,956][_main_][INFO] - [rank: 0] Instantiating datamodule <fish_speech.datasets.text.TextDataModule>
[2024-05-13 16:57:09,639][datasets][INFO] - PyTorch version 2.2.0 available.
[2024-05-13 16:57:10,409][_main_][INFO] - [rank: 0] Instantiating model <fish_speech.models.text2semantic.TextToSemantic>
[2024-05-13 16:57:16,370][_main_][INFO] - [rank: 0] Instantiating callbacks...
[2024-05-13 16:57:16,371][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.ModelCheckpoint>
[2024-05-13 16:57:16,377][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.ModelSummary>
[2024-05-13 16:57:16,377][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.LearningRateMonitor>
[2024-05-13 16:57:16,378][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <fish_speech.callbacks.GradNormMonitor>
[2024-05-13 16:57:16,389][_main_][INFO] - [rank: 0] Instantiating loggers...
[2024-05-13 16:57:16,390][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating logger <lightning.pytorch.loggers.tensorboard.TensorBoardLogger>
[2024-05-13 16:57:16,395][_main_][INFO] - [rank: 0] Instantiating trainer <lightning.pytorch.trainer.Trainer>
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2024-05-13 16:57:18,794][_main_][INFO] - [rank: 0] Logging hyperparameters!
[2024-05-13 16:57:19,240][_main_][INFO] - [rank: 0] Starting training!
[2024-05-13 16:57:19,245][_main_][INFO] - [rank: 0] Resuming from checkpoint: results/text2semantic_finetune_44k_ar2/checkpoints/step_000001000.ckpt
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:653: Checkpoint directory /home/test/code/TTS/llm_tts/egs/gpt/_tuned/results/text2semantic_finetune_44k_ar2/checkpoints exists and is not empty.
Restoring states from the checkpoint path at results/text2semantic_finetune_44k_ar2/checkpoints/step_000001000.ckpt
[2024-05-13 16:57:34,357][fish_speech.utils.utils][ERROR] - [rank: 0] 
Traceback (most recent call last):
  File "/home/test/code/TTS/llm_tts/fish_speech/utils/utils.py", line 66, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
  File "/home/test/code/TTS/llm_tts/egs/gpt/_tuned/fish_speech/train.py", line 108, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=ckpt_path)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 956, in _run
    self._checkpoint_connector._restore_modules_and_callbacks(ckpt_path)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 398, in _restore_modules_and_callbacks
    self.restore_model()
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 275, in restore_model
    self.trainer.strategy.load_model_state_dict(
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 372, in load_model_state_dict
    self.lightning_module.load_state_dict(checkpoint["state_dict"], strict=strict)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TextToSemantic:
	Missing key(s) in state_dict: "model.embeddings.lora_A", "model.embeddings.lora_B", "model.layers.0.attention.wqkv.lora_A", "model.layers.0.attention.wqkv.lora_B", "model.layers.0.attention.wo.lora_A", "model.layers.0.attention.wo.lora_B", "model.layers.0.feed_forward.w1.lora_A", "model.layers.0.feed_forward.w1.lora_B", "model.layers.0.feed_forward.w3.lora_A", "model.layers.0.feed_forward.w3.lora_B", "model.layers.0.feed_forward.w2.lora_A", "model.layers.0.feed_forward.w2.lora_B", "model.layers.1.attention.wqkv.lora_A", "model.layers.1.attention.wqkv.lora_B", "model.layers.1.attention.wo.lora_A", "model.layers.1.attention.wo.lora_B", "model.layers.1.feed_forward.w1.lora_A", "model.layers.1.feed_forward.w1.lora_B", "model.layers.1.feed_forward.w3.lora_A", "model.layers.1.feed_forward.w3.lora_B", "model.layers.1.feed_forward.w2.lora_A", "model.layers.1.feed_forward.w2.lora_B", "model.layers.2.attention.wqkv.lora_A", "model.layers.2.attention.wqkv.lora_B", "model.layers.2.attention.wo.lora_A", "model.layers.2.attention.wo.lora_B", "model.layers.2.feed_forward.w1.lora_A", "model.layers.2.feed_forward.w1.lora_B", "model.layers.2.feed_forward.w3.lora_A", "model.layers.2.feed_forward.w3.lora_B", "model.layers.2.feed_forward.w2.lora_A", "model.layers.2.feed_forward.w2.lora_B", "model.layers.3.attention.wqkv.lora_A", "model.layers.3.attention.wqkv.lora_B", "model.layers.3.attention.wo.lora_A", "model.layers.3.attention.wo.lora_B", "model.layers.3.feed_forward.w1.lora_A", "model.layers.3.feed_forward.w1.lora_B", "model.layers.3.feed_forward.w3.lora_A", "model.layers.3.feed_forward.w3.lora_B", "model.layers.3.feed_forward.w2.lora_A", "model.layers.3.feed_forward.w2.lora_B", "model.layers.4.attention.wqkv.lora_A", "model.layers.4.attention.wqkv.lora_B", "model.layers.4.attention.wo.lora_A", "model.layers.4.attention.wo.lora_B", "model.layers.4.feed_forward.w1.lora_A", "model.layers.4.feed_forward.w1.lora_B", "model.layers.4.feed_forward.w3.lora_A", "model.layers.4.feed_forward.w3.lora_B", "model.layers.4.feed_forward.w2.lora_A", "model.layers.4.feed_forward.w2.lora_B", "model.layers.5.attention.wqkv.lora_A", "model.layers.5.attention.wqkv.lora_B", "model.layers.5.attention.wo.lora_A", "model.layers.5.attention.wo.lora_B", "model.layers.5.feed_forward.w1.lora_A", "model.layers.5.feed_forward.w1.lora_B", "model.layers.5.feed_forward.w3.lora_A", "model.layers.5.feed_forward.w3.lora_B", "model.layers.5.feed_forward.w2.lora_A", "model.layers.5.feed_forward.w2.lora_B", "model.layers.6.attention.wqkv.lora_A", "model.layers.6.attention.wqkv.lora_B", "model.layers.6.attention.wo.lora_A", "model.layers.6.attention.wo.lora_B", "model.layers.6.feed_forward.w1.lora_A", "model.layers.6.feed_forward.w1.lora_B", "model.layers.6.feed_forward.w3.lora_A", "model.layers.6.feed_forward.w3.lora_B", "model.layers.6.feed_forward.w2.lora_A", "model.layers.6.feed_forward.w2.lora_B", "model.layers.7.attention.wqkv.lora_A", "model.layers.7.attention.wqkv.lora_B", "model.layers.7.attention.wo.lora_A", "model.layers.7.attention.wo.lora_B", "model.layers.7.feed_forward.w1.lora_A", "model.layers.7.feed_forward.w1.lora_B", "model.layers.7.feed_forward.w3.lora_A", "model.layers.7.feed_forward.w3.lora_B", "model.layers.7.feed_forward.w2.lora_A", "model.layers.7.feed_forward.w2.lora_B", "model.layers.8.attention.wqkv.lora_A", "model.layers.8.attention.wqkv.lora_B", "model.layers.8.attention.wo.lora_A", "model.layers.8.attention.wo.lora_B", "model.layers.8.feed_forward.w1.lora_A", "model.layers.8.feed_forward.w1.lora_B", "model.layers.8.feed_forward.w3.lora_A", "model.layers.8.feed_forward.w3.lora_B", "model.layers.8.feed_forward.w2.lora_A", "model.layers.8.feed_forward.w2.lora_B", "model.layers.9.attention.wqkv.lora_A", "model.layers.9.attention.wqkv.lora_B", "model.layers.9.attention.wo.lora_A", "model.layers.9.attention.wo.lora_B", "model.layers.9.feed_forward.w1.lora_A", "model.layers.9.feed_forward.w1.lora_B", "model.layers.9.feed_forward.w3.lora_A", "model.layers.9.feed_forward.w3.lora_B", "model.layers.9.feed_forward.w2.lora_A", "model.layers.9.feed_forward.w2.lora_B", "model.layers.10.attention.wqkv.lora_A", "model.layers.10.attention.wqkv.lora_B", "model.layers.10.attention.wo.lora_A", "model.layers.10.attention.wo.lora_B", "model.layers.10.feed_forward.w1.lora_A", "model.layers.10.feed_forward.w1.lora_B", "model.layers.10.feed_forward.w3.lora_A", "model.layers.10.feed_forward.w3.lora_B", "model.layers.10.feed_forward.w2.lora_A", "model.layers.10.feed_forward.w2.lora_B", "model.layers.11.attention.wqkv.lora_A", "model.layers.11.attention.wqkv.lora_B", "model.layers.11.attention.wo.lora_A", "model.layers.11.attention.wo.lora_B", "model.layers.11.feed_forward.w1.lora_A", "model.layers.11.feed_forward.w1.lora_B", "model.layers.11.feed_forward.w3.lora_A", "model.layers.11.feed_forward.w3.lora_B", "model.layers.11.feed_forward.w2.lora_A", "model.layers.11.feed_forward.w2.lora_B", "model.layers.12.attention.wqkv.lora_A", "model.layers.12.attention.wqkv.lora_B", "model.layers.12.attention.wo.lora_A", "model.layers.12.attention.wo.lora_B", "model.layers.12.feed_forward.w1.lora_A", "model.layers.12.feed_forward.w1.lora_B", "model.layers.12.feed_forward.w3.lora_A", "model.layers.12.feed_forward.w3.lora_B", "model.layers.12.feed_forward.w2.lora_A", "model.layers.12.feed_forward.w2.lora_B", "model.layers.13.attention.wqkv.lora_A", "model.layers.13.attention.wqkv.lora_B", "model.layers.13.attention.wo.lora_A", "model.layers.13.attention.wo.lora_B", "model.layers.13.feed_forward.w1.lora_A", "model.layers.13.feed_forward.w1.lora_B", "model.layers.13.feed_forward.w3.lora_A", "model.layers.13.feed_forward.w3.lora_B", "model.layers.13.feed_forward.w2.lora_A", "model.layers.13.feed_forward.w2.lora_B", "model.layers.14.attention.wqkv.lora_A", "model.layers.14.attention.wqkv.lora_B", "model.layers.14.attention.wo.lora_A", "model.layers.14.attention.wo.lora_B", "model.layers.14.feed_forward.w1.lora_A", "model.layers.14.feed_forward.w1.lora_B", "model.layers.14.feed_forward.w3.lora_A", "model.layers.14.feed_forward.w3.lora_B", "model.layers.14.feed_forward.w2.lora_A", "model.layers.14.feed_forward.w2.lora_B", "model.layers.15.attention.wqkv.lora_A", "model.layers.15.attention.wqkv.lora_B", "model.layers.15.attention.wo.lora_A", "model.layers.15.attention.wo.lora_B", "model.layers.15.feed_forward.w1.lora_A", "model.layers.15.feed_forward.w1.lora_B", "model.layers.15.feed_forward.w3.lora_A", "model.layers.15.feed_forward.w3.lora_B", "model.layers.15.feed_forward.w2.lora_A", "model.layers.15.feed_forward.w2.lora_B", "model.layers.16.attention.wqkv.lora_A", "model.layers.16.attention.wqkv.lora_B", "model.layers.16.attention.wo.lora_A", "model.layers.16.attention.wo.lora_B", "model.layers.16.feed_forward.w1.lora_A", "model.layers.16.feed_forward.w1.lora_B", "model.layers.16.feed_forward.w3.lora_A", "model.layers.16.feed_forward.w3.lora_B", "model.layers.16.feed_forward.w2.lora_A", "model.layers.16.feed_forward.w2.lora_B", "model.layers.17.attention.wqkv.lora_A", "model.layers.17.attention.wqkv.lora_B", "model.layers.17.attention.wo.lora_A", "model.layers.17.attention.wo.lora_B", "model.layers.17.feed_forward.w1.lora_A", "model.layers.17.feed_forward.w1.lora_B", "model.layers.17.feed_forward.w3.lora_A", "model.layers.17.feed_forward.w3.lora_B", "model.layers.17.feed_forward.w2.lora_A", "model.layers.17.feed_forward.w2.lora_B", "model.layers.18.attention.wqkv.lora_A", "model.layers.18.attention.wqkv.lora_B", "model.layers.18.attention.wo.lora_A", "model.layers.18.attention.wo.lora_B", "model.layers.18.feed_forward.w1.lora_A", "model.layers.18.feed_forward.w1.lora_B", "model.layers.18.feed_forward.w3.lora_A", "model.layers.18.feed_forward.w3.lora_B", "model.layers.18.feed_forward.w2.lora_A", "model.layers.18.feed_forward.w2.lora_B", "model.layers.19.attention.wqkv.lora_A", "model.layers.19.attention.wqkv.lora_B", "model.layers.19.attention.wo.lora_A", "model.layers.19.attention.wo.lora_B", "model.layers.19.feed_forward.w1.lora_A", "model.layers.19.feed_forward.w1.lora_B", "model.layers.19.feed_forward.w3.lora_A", "model.layers.19.feed_forward.w3.lora_B", "model.layers.19.feed_forward.w2.lora_A", "model.layers.19.feed_forward.w2.lora_B", "model.layers.20.attention.wqkv.lora_A", "model.layers.20.attention.wqkv.lora_B", "model.layers.20.attention.wo.lora_A", "model.layers.20.attention.wo.lora_B", "model.layers.20.feed_forward.w1.lora_A", "model.layers.20.feed_forward.w1.lora_B", "model.layers.20.feed_forward.w3.lora_A", "model.layers.20.feed_forward.w3.lora_B", "model.layers.20.feed_forward.w2.lora_A", "model.layers.20.feed_forward.w2.lora_B", "model.layers.21.attention.wqkv.lora_A", "model.layers.21.attention.wqkv.lora_B", "model.layers.21.attention.wo.lora_A", "model.layers.21.attention.wo.lora_B", "model.layers.21.feed_forward.w1.lora_A", "model.layers.21.feed_forward.w1.lora_B", "model.layers.21.feed_forward.w3.lora_A", "model.layers.21.feed_forward.w3.lora_B", "model.layers.21.feed_forward.w2.lora_A", "model.layers.21.feed_forward.w2.lora_B", "model.layers.22.attention.wqkv.lora_A", "model.layers.22.attention.wqkv.lora_B", "model.layers.22.attention.wo.lora_A", "model.layers.22.attention.wo.lora_B", "model.layers.22.feed_forward.w1.lora_A", "model.layers.22.feed_forward.w1.lora_B", "model.layers.22.feed_forward.w3.lora_A", "model.layers.22.feed_forward.w3.lora_B", "model.layers.22.feed_forward.w2.lora_A", "model.layers.22.feed_forward.w2.lora_B", "model.layers.23.attention.wqkv.lora_A", "model.layers.23.attention.wqkv.lora_B", "model.layers.23.attention.wo.lora_A", "model.layers.23.attention.wo.lora_B", "model.layers.23.feed_forward.w1.lora_A", "model.layers.23.feed_forward.w1.lora_B", "model.layers.23.feed_forward.w3.lora_A", "model.layers.23.feed_forward.w3.lora_B", "model.layers.23.feed_forward.w2.lora_A", "model.layers.23.feed_forward.w2.lora_B", "model.output.lora_A", "model.output.lora_B", "model.fast_embeddings.lora_A", "model.fast_embeddings.lora_B", "model.fast_layers.0.attention.wqkv.lora_A", "model.fast_layers.0.attention.wqkv.lora_B", "model.fast_layers.0.attention.wo.lora_A", "model.fast_layers.0.attention.wo.lora_B", "model.fast_layers.0.feed_forward.w1.lora_A", "model.fast_layers.0.feed_forward.w1.lora_B", "model.fast_layers.0.feed_forward.w3.lora_A", "model.fast_layers.0.feed_forward.w3.lora_B", "model.fast_layers.0.feed_forward.w2.lora_A", "model.fast_layers.0.feed_forward.w2.lora_B", "model.fast_layers.1.attention.wqkv.lora_A", "model.fast_layers.1.attention.wqkv.lora_B", "model.fast_layers.1.attention.wo.lora_A", "model.fast_layers.1.attention.wo.lora_B", "model.fast_layers.1.feed_forward.w1.lora_A", "model.fast_layers.1.feed_forward.w1.lora_B", "model.fast_layers.1.feed_forward.w3.lora_A", "model.fast_layers.1.feed_forward.w3.lora_B", "model.fast_layers.1.feed_forward.w2.lora_A", "model.fast_layers.1.feed_forward.w2.lora_B", "model.fast_layers.2.attention.wqkv.lora_A", "model.fast_layers.2.attention.wqkv.lora_B", "model.fast_layers.2.attention.wo.lora_A", "model.fast_layers.2.attention.wo.lora_B", "model.fast_layers.2.feed_forward.w1.lora_A", "model.fast_layers.2.feed_forward.w1.lora_B", "model.fast_layers.2.feed_forward.w3.lora_A", "model.fast_layers.2.feed_forward.w3.lora_B", "model.fast_layers.2.feed_forward.w2.lora_A", "model.fast_layers.2.feed_forward.w2.lora_B", "model.fast_layers.3.attention.wqkv.lora_A", "model.fast_layers.3.attention.wqkv.lora_B", "model.fast_layers.3.attention.wo.lora_A", "model.fast_layers.3.attention.wo.lora_B", "model.fast_layers.3.feed_forward.w1.lora_A", "model.fast_layers.3.feed_forward.w1.lora_B", "model.fast_layers.3.feed_forward.w3.lora_A", "model.fast_layers.3.feed_forward.w3.lora_B", "model.fast_layers.3.feed_forward.w2.lora_A", "model.fast_layers.3.feed_forward.w2.lora_B", "model.fast_layers.4.attention.wqkv.lora_A", "model.fast_layers.4.attention.wqkv.lora_B", "model.fast_layers.4.attention.wo.lora_A", "model.fast_layers.4.attention.wo.lora_B", "model.fast_layers.4.feed_forward.w1.lora_A", "model.fast_layers.4.feed_forward.w1.lora_B", "model.fast_layers.4.feed_forward.w3.lora_A", "model.fast_layers.4.feed_forward.w3.lora_B", "model.fast_layers.4.feed_forward.w2.lora_A", "model.fast_layers.4.feed_forward.w2.lora_B", "model.fast_layers.5.attention.wqkv.lora_A", "model.fast_layers.5.attention.wqkv.lora_B", "model.fast_layers.5.attention.wo.lora_A", "model.fast_layers.5.attention.wo.lora_B", "model.fast_layers.5.feed_forward.w1.lora_A", "model.fast_layers.5.feed_forward.w1.lora_B", "model.fast_layers.5.feed_forward.w3.lora_A", "model.fast_layers.5.feed_forward.w3.lora_B", "model.fast_layers.5.feed_forward.w2.lora_A", "model.fast_layers.5.feed_forward.w2.lora_B", "model.fast_output.lora_A", "model.fast_output.lora_B". 
[2024-05-13 16:57:34,423][fish_speech.utils.utils][INFO] - [rank: 0] Output dir: results/text2semantic_finetune_44k_ar2
Error executing job with overrides: ['[email protected]=dual_ar_2_codebook_medium', '[email protected]_config=r_8_alpha_16']
Traceback (most recent call last):
  File "/home/test/code/TTS/llm_tts/egs/gpt/_tuned/fish_speech/train.py", line 135, in main
    train(cfg)
  File "/home/test/code/TTS/llm_tts/fish_speech/utils/utils.py", line 77, in wrap
    raise ex
  File "/home/test/code/TTS/llm_tts/fish_speech/utils/utils.py", line 66, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
  File "/home/test/code/TTS/llm_tts/egs/gpt/_tuned/fish_speech/train.py", line 108, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=ckpt_path)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 956, in _run
    self._checkpoint_connector._restore_modules_and_callbacks(ckpt_path)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 398, in _restore_modules_and_callbacks
    self.restore_model()
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 275, in restore_model
    self.trainer.strategy.load_model_state_dict(
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 372, in load_model_state_dict
    self.lightning_module.load_state_dict(checkpoint["state_dict"], strict=strict)
  File "/home/test/python_env/anaconda3/envs/llm_fisher/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TextToSemantic:
	Missing key(s) in state_dict: "model.embeddings.lora_A", "model.embeddings.lora_B", "model.layers.0.attention.wqkv.lora_A", "model.layers.0.attention.wqkv.lora_B", "model.layers.0.attention.wo.lora_A", "model.layers.0.attention.wo.lora_B", "model.layers.0.feed_forward.w1.lora_A", "model.layers.0.feed_forward.w1.lora_B", "model.layers.0.feed_forward.w3.lora_A", "model.layers.0.feed_forward.w3.lora_B", "model.layers.0.feed_forward.w2.lora_A", "model.layers.0.feed_forward.w2.lora_B", "model.layers.1.attention.wqkv.lora_A", "model.layers.1.attention.wqkv.lora_B", "model.layers.1.attention.wo.lora_A", "model.layers.1.attention.wo.lora_B", "model.layers.1.feed_forward.w1.lora_A", "model.layers.1.feed_forward.w1.lora_B", "model.layers.1.feed_forward.w3.lora_A", "model.layers.1.feed_forward.w3.lora_B", "model.layers.1.feed_forward.w2.lora_A", "model.layers.1.feed_forward.w2.lora_B", "model.layers.2.attention.wqkv.lora_A", "model.layers.2.attention.wqkv.lora_B", "model.layers.2.attention.wo.lora_A", "model.layers.2.attention.wo.lora_B", "model.layers.2.feed_forward.w1.lora_A", "model.layers.2.feed_forward.w1.lora_B", "model.layers.2.feed_forward.w3.lora_A", "model.layers.2.feed_forward.w3.lora_B", "model.layers.2.feed_forward.w2.lora_A", "model.layers.2.feed_forward.w2.lora_B", "model.layers.3.attention.wqkv.lora_A", "model.layers.3.attention.wqkv.lora_B", "model.layers.3.attention.wo.lora_A", "model.layers.3.attention.wo.lora_B", "model.layers.3.feed_forward.w1.lora_A", "model.layers.3.feed_forward.w1.lora_B", "model.layers.3.feed_forward.w3.lora_A", "model.layers.3.feed_forward.w3.lora_B", "model.layers.3.feed_forward.w2.lora_A", "model.layers.3.feed_forward.w2.lora_B", "model.layers.4.attention.wqkv.lora_A", "model.layers.4.attention.wqkv.lora_B", "model.layers.4.attention.wo.lora_A", "model.layers.4.attention.wo.lora_B", "model.layers.4.feed_forward.w1.lora_A", "model.layers.4.feed_forward.w1.lora_B", "model.layers.4.feed_forward.w3.lora_A", "model.layers.4.feed_forward.w3.lora_B", "model.layers.4.feed_forward.w2.lora_A", "model.layers.4.feed_forward.w2.lora_B", "model.layers.5.attention.wqkv.lora_A", "model.layers.5.attention.wqkv.lora_B", "model.layers.5.attention.wo.lora_A", "model.layers.5.attention.wo.lora_B", "model.layers.5.feed_forward.w1.lora_A", "model.layers.5.feed_forward.w1.lora_B", "model.layers.5.feed_forward.w3.lora_A", "model.layers.5.feed_forward.w3.lora_B", "model.layers.5.feed_forward.w2.lora_A", "model.layers.5.feed_forward.w2.lora_B", "model.layers.6.attention.wqkv.lora_A", "model.layers.6.attention.wqkv.lora_B", "model.layers.6.attention.wo.lora_A", "model.layers.6.attention.wo.lora_B", "model.layers.6.feed_forward.w1.lora_A", "model.layers.6.feed_forward.w1.lora_B", "model.layers.6.feed_forward.w3.lora_A", "model.layers.6.feed_forward.w3.lora_B", "model.layers.6.feed_forward.w2.lora_A", "model.layers.6.feed_forward.w2.lora_B", "model.layers.7.attention.wqkv.lora_A", "model.layers.7.attention.wqkv.lora_B", "model.layers.7.attention.wo.lora_A", "model.layers.7.attention.wo.lora_B", "model.layers.7.feed_forward.w1.lora_A", "model.layers.7.feed_forward.w1.lora_B", "model.layers.7.feed_forward.w3.lora_A", "model.layers.7.feed_forward.w3.lora_B", "model.layers.7.feed_forward.w2.lora_A", "model.layers.7.feed_forward.w2.lora_B", "model.layers.8.attention.wqkv.lora_A", "model.layers.8.attention.wqkv.lora_B", "model.layers.8.attention.wo.lora_A", "model.layers.8.attention.wo.lora_B", "model.layers.8.feed_forward.w1.lora_A", "model.layers.8.feed_forward.w1.lora_B", "model.layers.8.feed_forward.w3.lora_A", "model.layers.8.feed_forward.w3.lora_B", "model.layers.8.feed_forward.w2.lora_A", "model.layers.8.feed_forward.w2.lora_B", "model.layers.9.attention.wqkv.lora_A", "model.layers.9.attention.wqkv.lora_B", "model.layers.9.attention.wo.lora_A", "model.layers.9.attention.wo.lora_B", "model.layers.9.feed_forward.w1.lora_A", "model.layers.9.feed_forward.w1.lora_B", "model.layers.9.feed_forward.w3.lora_A", "model.layers.9.feed_forward.w3.lora_B", "model.layers.9.feed_forward.w2.lora_A", "model.layers.9.feed_forward.w2.lora_B", "model.layers.10.attention.wqkv.lora_A", "model.layers.10.attention.wqkv.lora_B", "model.layers.10.attention.wo.lora_A", "model.layers.10.attention.wo.lora_B", "model.layers.10.feed_forward.w1.lora_A", "model.layers.10.feed_forward.w1.lora_B", "model.layers.10.feed_forward.w3.lora_A", "model.layers.10.feed_forward.w3.lora_B", "model.layers.10.feed_forward.w2.lora_A", "model.layers.10.feed_forward.w2.lora_B", "model.layers.11.attention.wqkv.lora_A", "model.layers.11.attention.wqkv.lora_B", "model.layers.11.attention.wo.lora_A", "model.layers.11.attention.wo.lora_B", "model.layers.11.feed_forward.w1.lora_A", "model.layers.11.feed_forward.w1.lora_B", "model.layers.11.feed_forward.w3.lora_A", "model.layers.11.feed_forward.w3.lora_B", "model.layers.11.feed_forward.w2.lora_A", "model.layers.11.feed_forward.w2.lora_B", "model.layers.12.attention.wqkv.lora_A", "model.layers.12.attention.wqkv.lora_B", "model.layers.12.attention.wo.lora_A", "model.layers.12.attention.wo.lora_B", "model.layers.12.feed_forward.w1.lora_A", "model.layers.12.feed_forward.w1.lora_B", "model.layers.12.feed_forward.w3.lora_A", "model.layers.12.feed_forward.w3.lora_B", "model.layers.12.feed_forward.w2.lora_A", "model.layers.12.feed_forward.w2.lora_B", "model.layers.13.attention.wqkv.lora_A", "model.layers.13.attention.wqkv.lora_B", "model.layers.13.attention.wo.lora_A", "model.layers.13.attention.wo.lora_B", "model.layers.13.feed_forward.w1.lora_A", "model.layers.13.feed_forward.w1.lora_B", "model.layers.13.feed_forward.w3.lora_A", "model.layers.13.feed_forward.w3.lora_B", "model.layers.13.feed_forward.w2.lora_A", "model.layers.13.feed_forward.w2.lora_B", "model.layers.14.attention.wqkv.lora_A", "model.layers.14.attention.wqkv.lora_B", "model.layers.14.attention.wo.lora_A", "model.layers.14.attention.wo.lora_B", "model.layers.14.feed_forward.w1.lora_A", "model.layers.14.feed_forward.w1.lora_B", "model.layers.14.feed_forward.w3.lora_A", "model.layers.14.feed_forward.w3.lora_B", "model.layers.14.feed_forward.w2.lora_A", "model.layers.14.feed_forward.w2.lora_B", "model.layers.15.attention.wqkv.lora_A", "model.layers.15.attention.wqkv.lora_B", "model.layers.15.attention.wo.lora_A", "model.layers.15.attention.wo.lora_B", "model.layers.15.feed_forward.w1.lora_A", "model.layers.15.feed_forward.w1.lora_B", "model.layers.15.feed_forward.w3.lora_A", "model.layers.15.feed_forward.w3.lora_B", "model.layers.15.feed_forward.w2.lora_A", "model.layers.15.feed_forward.w2.lora_B", "model.layers.16.attention.wqkv.lora_A", "model.layers.16.attention.wqkv.lora_B", "model.layers.16.attention.wo.lora_A", "model.layers.16.attention.wo.lora_B", "model.layers.16.feed_forward.w1.lora_A", "model.layers.16.feed_forward.w1.lora_B", "model.layers.16.feed_forward.w3.lora_A", "model.layers.16.feed_forward.w3.lora_B", "model.layers.16.feed_forward.w2.lora_A", "model.layers.16.feed_forward.w2.lora_B", "model.layers.17.attention.wqkv.lora_A", "model.layers.17.attention.wqkv.lora_B", "model.layers.17.attention.wo.lora_A", "model.layers.17.attention.wo.lora_B", "model.layers.17.feed_forward.w1.lora_A", "model.layers.17.feed_forward.w1.lora_B", "model.layers.17.feed_forward.w3.lora_A", "model.layers.17.feed_forward.w3.lora_B", "model.layers.17.feed_forward.w2.lora_A", "model.layers.17.feed_forward.w2.lora_B", "model.layers.18.attention.wqkv.lora_A", "model.layers.18.attention.wqkv.lora_B", "model.layers.18.attention.wo.lora_A", "model.layers.18.attention.wo.lora_B", "model.layers.18.feed_forward.w1.lora_A", "model.layers.18.feed_forward.w1.lora_B", "model.layers.18.feed_forward.w3.lora_A", "model.layers.18.feed_forward.w3.lora_B", "model.layers.18.feed_forward.w2.lora_A", "model.layers.18.feed_forward.w2.lora_B", "model.layers.19.attention.wqkv.lora_A", "model.layers.19.attention.wqkv.lora_B", "model.layers.19.attention.wo.lora_A", "model.layers.19.attention.wo.lora_B", "model.layers.19.feed_forward.w1.lora_A", "model.layers.19.feed_forward.w1.lora_B", "model.layers.19.feed_forward.w3.lora_A", "model.layers.19.feed_forward.w3.lora_B", "model.layers.19.feed_forward.w2.lora_A", "model.layers.19.feed_forward.w2.lora_B", "model.layers.20.attention.wqkv.lora_A", "model.layers.20.attention.wqkv.lora_B", "model.layers.20.attention.wo.lora_A", "model.layers.20.attention.wo.lora_B", "model.layers.20.feed_forward.w1.lora_A", "model.layers.20.feed_forward.w1.lora_B", "model.layers.20.feed_forward.w3.lora_A", "model.layers.20.feed_forward.w3.lora_B", "model.layers.20.feed_forward.w2.lora_A", "model.layers.20.feed_forward.w2.lora_B", "model.layers.21.attention.wqkv.lora_A", "model.layers.21.attention.wqkv.lora_B", "model.layers.21.attention.wo.lora_A", "model.layers.21.attention.wo.lora_B", "model.layers.21.feed_forward.w1.lora_A", "model.layers.21.feed_forward.w1.lora_B", "model.layers.21.feed_forward.w3.lora_A", "model.layers.21.feed_forward.w3.lora_B", "model.layers.21.feed_forward.w2.lora_A", "model.layers.21.feed_forward.w2.lora_B", "model.layers.22.attention.wqkv.lora_A", "model.layers.22.attention.wqkv.lora_B", "model.layers.22.attention.wo.lora_A", "model.layers.22.attention.wo.lora_B", "model.layers.22.feed_forward.w1.lora_A", "model.layers.22.feed_forward.w1.lora_B", "model.layers.22.feed_forward.w3.lora_A", "model.layers.22.feed_forward.w3.lora_B", "model.layers.22.feed_forward.w2.lora_A", "model.layers.22.feed_forward.w2.lora_B", "model.layers.23.attention.wqkv.lora_A", "model.layers.23.attention.wqkv.lora_B", "model.layers.23.attention.wo.lora_A", "model.layers.23.attention.wo.lora_B", "model.layers.23.feed_forward.w1.lora_A", "model.layers.23.feed_forward.w1.lora_B", "model.layers.23.feed_forward.w3.lora_A", "model.layers.23.feed_forward.w3.lora_B", "model.layers.23.feed_forward.w2.lora_A", "model.layers.23.feed_forward.w2.lora_B", "model.output.lora_A", "model.output.lora_B", "model.fast_embeddings.lora_A", "model.fast_embeddings.lora_B", "model.fast_layers.0.attention.wqkv.lora_A", "model.fast_layers.0.attention.wqkv.lora_B", "model.fast_layers.0.attention.wo.lora_A", "model.fast_layers.0.attention.wo.lora_B", "model.fast_layers.0.feed_forward.w1.lora_A", "model.fast_layers.0.feed_forward.w1.lora_B", "model.fast_layers.0.feed_forward.w3.lora_A", "model.fast_layers.0.feed_forward.w3.lora_B", "model.fast_layers.0.feed_forward.w2.lora_A", "model.fast_layers.0.feed_forward.w2.lora_B", "model.fast_layers.1.attention.wqkv.lora_A", "model.fast_layers.1.attention.wqkv.lora_B", "model.fast_layers.1.attention.wo.lora_A", "model.fast_layers.1.attention.wo.lora_B", "model.fast_layers.1.feed_forward.w1.lora_A", "model.fast_layers.1.feed_forward.w1.lora_B", "model.fast_layers.1.feed_forward.w3.lora_A", "model.fast_layers.1.feed_forward.w3.lora_B", "model.fast_layers.1.feed_forward.w2.lora_A", "model.fast_layers.1.feed_forward.w2.lora_B", "model.fast_layers.2.attention.wqkv.lora_A", "model.fast_layers.2.attention.wqkv.lora_B", "model.fast_layers.2.attention.wo.lora_A", "model.fast_layers.2.attention.wo.lora_B", "model.fast_layers.2.feed_forward.w1.lora_A", "model.fast_layers.2.feed_forward.w1.lora_B", "model.fast_layers.2.feed_forward.w3.lora_A", "model.fast_layers.2.feed_forward.w3.lora_B", "model.fast_layers.2.feed_forward.w2.lora_A", "model.fast_layers.2.feed_forward.w2.lora_B", "model.fast_layers.3.attention.wqkv.lora_A", "model.fast_layers.3.attention.wqkv.lora_B", "model.fast_layers.3.attention.wo.lora_A", "model.fast_layers.3.attention.wo.lora_B", "model.fast_layers.3.feed_forward.w1.lora_A", "model.fast_layers.3.feed_forward.w1.lora_B", "model.fast_layers.3.feed_forward.w3.lora_A", "model.fast_layers.3.feed_forward.w3.lora_B", "model.fast_layers.3.feed_forward.w2.lora_A", "model.fast_layers.3.feed_forward.w2.lora_B", "model.fast_layers.4.attention.wqkv.lora_A", "model.fast_layers.4.attention.wqkv.lora_B", "model.fast_layers.4.attention.wo.lora_A", "model.fast_layers.4.attention.wo.lora_B", "model.fast_layers.4.feed_forward.w1.lora_A", "model.fast_layers.4.feed_forward.w1.lora_B", "model.fast_layers.4.feed_forward.w3.lora_A", "model.fast_layers.4.feed_forward.w3.lora_B", "model.fast_layers.4.feed_forward.w2.lora_A", "model.fast_layers.4.feed_forward.w2.lora_B", "model.fast_layers.5.attention.wqkv.lora_A", "model.fast_layers.5.attention.wqkv.lora_B", "model.fast_layers.5.attention.wo.lora_A", "model.fast_layers.5.attention.wo.lora_B", "model.fast_layers.5.feed_forward.w1.lora_A", "model.fast_layers.5.feed_forward.w1.lora_B", "model.fast_layers.5.feed_forward.w3.lora_A", "model.fast_layers.5.feed_forward.w3.lora_B", "model.fast_layers.5.feed_forward.w2.lora_A", "model.fast_layers.5.feed_forward.w2.lora_B", "model.fast_output.lora_A", "model.fast_output.lora_B". 

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Additional context
Add any other context about the problem here.

@didadida-r didadida-r added the bug Something isn't working label May 14, 2024
@leng-yue
Copy link
Member

Seems you are resuming training, which is not currently supported for LoRA

@yixian3500
Copy link

Seems you are resuming training, which is not currently supported for LoRA

I got the same error here, could you help to point what is the correct steps? the way isn't “ resuming training" ? thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants