Basic iwslt config train failure due to directory errors #214

mcognetta · 2023-08-24T16:18:46Z

Describe the bug
When trying to run one of the iwslt example config training runs, I repeatedly got errors due to the BPE code files not being properly moved to the model directory.

The fix is to ensure that the correct tokenizer data directory structure is present before training begins. The relevant line is

joeynmt/joeynmt/tokenizers.py

Lines 309 to 311 in 0968187

 def copy_cfg_file(self, model_dir: Path) -> None: 

 """Copy config file to model_dir""" 

 shutil.copy2(self.codes, (model_dir / self.codes.name).as_posix())

which I replaced with:

        if not os.path.exists(os.path.dirname(model_dir / self.codes.name)):
            os.makedirs(os.path.dirname(model_dir / self.codes.name))
        shutil.copy2(self.codes.name, (model_dir / self.codes.name))

to fix my specific case. It probably needs to be added elsewhere though (at least to the other tokenizer classes). Note: I removed the as_posix() for another reason during testing, but that is not relevant to this bug.

I was able to reproduce this bug as well as the fix on two different machines. I am happy to contribute the patch, if this is truly a bug and I am not missing something simple.

To Reproduce
Steps to reproduce the behavior:

Run the scripts/get_iwslt14_bpe.sh script
Use the config file copied below (a slight modification from the base configs/iwslt14_deen_bpe.yaml)
Run python scripts/build_vocab.py configs/iwslt14_deen_bpe.yaml
Run python3 -m joeynmt train configs/iwslt14_deen_bpe.yaml

Logged output
Relevant log, showing that the BPE code files were not properly copied over:

Traceback (most recent call last):
  File "/home/marco/anaconda3/envs/joeynmt/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/marco/anaconda3/envs/joeynmt/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/marco/github/joeynmt/joeynmt/__main__.py", line 61, in <module>
    main()
  File "/home/marco/github/joeynmt/joeynmt/__main__.py", line 41, in main
    train(cfg_file=args.config_path, skip_test=args.skip_test)
  File "/home/marco/github/joeynmt/joeynmt/training.py", line 814, in train
    train_data.tokenizer[train_data.src_lang].copy_cfg_file(model_dir)
  File "/home/marco/github/joeynmt/joeynmt/tokenizers.py", line 315, in copy_cfg_file
    shutil.copy2(self.codes.name, (model_dir / self.codes.name).as_posix())
  File "/home/marco/anaconda3/envs/joeynmt/lib/python3.10/shutil.py", line 439, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/home/marco/anaconda3/envs/joeynmt/lib/python3.10/shutil.py", line 261, in copyfile
    with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/home/marco/github/joeynmt/models/transformer_iwslt14_deen_bpe/test/data/iwslt14/bpe.6000.de'

Expected behavior
All tokenizer information copied over, and training running as normal.

System (please complete the following information):

OS: Linux
CPU (happened before training)
Python3.10

Config file:

name: "transformer_iwslt14_deen_bpe"
joeynmt_version: "2.0.0"

data:
    train: "test/data/iwslt14/train"
    dev: "test/data/iwslt14/valid"
    test: "test/data/iwslt14/test"
    dataset_type: "plain"
    src:
        lang: "de"
        max_length: 62
        lowercase: True
        normalize: False
        level: "bpe"
        voc_min_freq: 1
        voc_file: "test/data/iwslt14/vocab.de.txt"
        tokenizer_type: "subword-nmt"
        tokenizer_cfg:
            num_merges: 6000
            codes: "test/data/iwslt14/bpe.6000.de"
            pretokenizer: "none"
    trg:
        lang: "en"
        max_length: 62
        lowercase: True
        normalize: False
        level: "bpe"
        voc_min_freq: 1
        voc_file: "test/data/iwslt14/vocab.en.txt"
        tokenizer_type: "subword-nmt"
        tokenizer_cfg:
            num_merges: 6000
            codes: "test/data/iwslt14/bpe.6000.en"
            pretokenizer: "none"

testing:
    n_best: 1
    beam_size: 5
    beam_alpha: 1.0
    batch_size: 1024
    batch_type: "token"
    max_output_length: 100
    eval_metrics: ["bleu"]
    return_prob: "none"
    return_attention: False
    sacrebleu_cfg:
        tokenize: "13a"
        lowercase: True

training:
    #load_model: "models/transformer_iwslt14_deen_bpe/best.ckpt"
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999]
    scheduling: "warmupinversesquareroot"
    warmup: 4000
    loss: "crossentropy"
    learning_rate: 0.0005
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    early_stopping_metric: "bleu"
    epochs: 100
    validation_freq: 1000
    logging_freq: 100
    model_dir: "models/transformer_iwslt14_deen_bpe"
    overwrite: False
    shuffle: True
    use_cuda: True
    print_valid_sents: [0, 1, 2, 3, 4]
    keep_best_ckpts: 5

model:
    initializer: "xavier_uniform"
    embed_initializer: "xavier_uniform"
    embed_init_gain: 1.0
    init_gain: 1.0
    bias_initializer: "zeros"
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 512
            scale: True
            dropout: 0.
        # typically ff_size = 4 x hidden_size
        hidden_size: 512
        ff_size: 1024
        dropout: 0.3
        layer_norm: "pre"
        activation: "relu"
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 512
            scale: True
            dropout: 0.
        # typically ff_size = 4 x hidden_size
        hidden_size: 512
        ff_size: 1024
        dropout: 0.3
        layer_norm: "pre"
        activation: "relu"

The text was updated successfully, but these errors were encountered:

may- · 2024-01-18T16:47:17Z

comment added to the script

joeynmt/scripts/get_iwslt14_bpe.sh

Lines 6 to 15 in d6247c5

 ############## https://github.com/joeynmt/joeynmt/pull/216 

 # Usage: 

 # $ cd /path/to/joeynmt/scripts # Call this script from /path/to/joeynmt/scripts dir 

 # $ bash get_iwslt14_bpe.sh # This will create /path/to/joeynmt/test/data/iwslt14/{train | valid | test}.{en | de} 

 # # Make sure that /path/to/joeynmt/test/data/iwslt14/bpe.32000 exists, too. 

 # $ cd .. # now back to /path/to/joeynmt/ 

 # 

 # Train: comment out the `voc_file` lines in the data section -> vocab files will be created in the training process  

 # $ python -m joeynmt train configs/iwslt14_deen_bpe.yaml --skip-test 

 ##############

mcognetta mentioned this issue Aug 25, 2023

Ensure correct model tokenizer directory structure exists prior to copying #216

Open

may- closed this as completed Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic iwslt config train failure due to directory errors #214

Basic iwslt config train failure due to directory errors #214

mcognetta commented Aug 24, 2023 •

edited

may- commented Jan 18, 2024

Basic iwslt config train failure due to directory errors #214

Basic iwslt config train failure due to directory errors #214

Comments

mcognetta commented Aug 24, 2023 • edited

may- commented Jan 18, 2024

mcognetta commented Aug 24, 2023 •

edited