Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic iwslt config train failure due to directory errors #214

Closed
mcognetta opened this issue Aug 24, 2023 · 1 comment · May be fixed by #216
Closed

Basic iwslt config train failure due to directory errors #214

mcognetta opened this issue Aug 24, 2023 · 1 comment · May be fixed by #216

Comments

@mcognetta
Copy link
Contributor

mcognetta commented Aug 24, 2023

Describe the bug
When trying to run one of the iwslt example config training runs, I repeatedly got errors due to the BPE code files not being properly moved to the model directory.

The fix is to ensure that the correct tokenizer data directory structure is present before training begins. The relevant line is

def copy_cfg_file(self, model_dir: Path) -> None:
"""Copy config file to model_dir"""
shutil.copy2(self.codes, (model_dir / self.codes.name).as_posix())

which I replaced with:

        if not os.path.exists(os.path.dirname(model_dir / self.codes.name)):
            os.makedirs(os.path.dirname(model_dir / self.codes.name))
        shutil.copy2(self.codes.name, (model_dir / self.codes.name))

to fix my specific case. It probably needs to be added elsewhere though (at least to the other tokenizer classes). Note: I removed the as_posix() for another reason during testing, but that is not relevant to this bug.

I was able to reproduce this bug as well as the fix on two different machines. I am happy to contribute the patch, if this is truly a bug and I am not missing something simple.

To Reproduce
Steps to reproduce the behavior:

  1. Run the scripts/get_iwslt14_bpe.sh script
  2. Use the config file copied below (a slight modification from the base configs/iwslt14_deen_bpe.yaml)
  3. Run python scripts/build_vocab.py configs/iwslt14_deen_bpe.yaml
  4. Run python3 -m joeynmt train configs/iwslt14_deen_bpe.yaml

Logged output
Relevant log, showing that the BPE code files were not properly copied over:

Traceback (most recent call last):
  File "/home/marco/anaconda3/envs/joeynmt/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/marco/anaconda3/envs/joeynmt/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/marco/github/joeynmt/joeynmt/__main__.py", line 61, in <module>
    main()
  File "/home/marco/github/joeynmt/joeynmt/__main__.py", line 41, in main
    train(cfg_file=args.config_path, skip_test=args.skip_test)
  File "/home/marco/github/joeynmt/joeynmt/training.py", line 814, in train
    train_data.tokenizer[train_data.src_lang].copy_cfg_file(model_dir)
  File "/home/marco/github/joeynmt/joeynmt/tokenizers.py", line 315, in copy_cfg_file
    shutil.copy2(self.codes.name, (model_dir / self.codes.name).as_posix())
  File "/home/marco/anaconda3/envs/joeynmt/lib/python3.10/shutil.py", line 439, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/home/marco/anaconda3/envs/joeynmt/lib/python3.10/shutil.py", line 261, in copyfile
    with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/home/marco/github/joeynmt/models/transformer_iwslt14_deen_bpe/test/data/iwslt14/bpe.6000.de'

Expected behavior
All tokenizer information copied over, and training running as normal.

System (please complete the following information):

  • OS: Linux
  • CPU (happened before training)
  • Python3.10

Config file:

name: "transformer_iwslt14_deen_bpe"
joeynmt_version: "2.0.0"

data:
    train: "test/data/iwslt14/train"
    dev: "test/data/iwslt14/valid"
    test: "test/data/iwslt14/test"
    dataset_type: "plain"
    src:
        lang: "de"
        max_length: 62
        lowercase: True
        normalize: False
        level: "bpe"
        voc_min_freq: 1
        voc_file: "test/data/iwslt14/vocab.de.txt"
        tokenizer_type: "subword-nmt"
        tokenizer_cfg:
            num_merges: 6000
            codes: "test/data/iwslt14/bpe.6000.de"
            pretokenizer: "none"
    trg:
        lang: "en"
        max_length: 62
        lowercase: True
        normalize: False
        level: "bpe"
        voc_min_freq: 1
        voc_file: "test/data/iwslt14/vocab.en.txt"
        tokenizer_type: "subword-nmt"
        tokenizer_cfg:
            num_merges: 6000
            codes: "test/data/iwslt14/bpe.6000.en"
            pretokenizer: "none"

testing:
    n_best: 1
    beam_size: 5
    beam_alpha: 1.0
    batch_size: 1024
    batch_type: "token"
    max_output_length: 100
    eval_metrics: ["bleu"]
    return_prob: "none"
    return_attention: False
    sacrebleu_cfg:
        tokenize: "13a"
        lowercase: True

training:
    #load_model: "models/transformer_iwslt14_deen_bpe/best.ckpt"
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999]
    scheduling: "warmupinversesquareroot"
    warmup: 4000
    loss: "crossentropy"
    learning_rate: 0.0005
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    early_stopping_metric: "bleu"
    epochs: 100
    validation_freq: 1000
    logging_freq: 100
    model_dir: "models/transformer_iwslt14_deen_bpe"
    overwrite: False
    shuffle: True
    use_cuda: True
    print_valid_sents: [0, 1, 2, 3, 4]
    keep_best_ckpts: 5

model:
    initializer: "xavier_uniform"
    embed_initializer: "xavier_uniform"
    embed_init_gain: 1.0
    init_gain: 1.0
    bias_initializer: "zeros"
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 512
            scale: True
            dropout: 0.
        # typically ff_size = 4 x hidden_size
        hidden_size: 512
        ff_size: 1024
        dropout: 0.3
        layer_norm: "pre"
        activation: "relu"
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 512
            scale: True
            dropout: 0.
        # typically ff_size = 4 x hidden_size
        hidden_size: 512
        ff_size: 1024
        dropout: 0.3
        layer_norm: "pre"
        activation: "relu"
@may-
Copy link
Collaborator

may- commented Jan 18, 2024

comment added to the script

############## https://github.com/joeynmt/joeynmt/pull/216
# Usage:
# $ cd /path/to/joeynmt/scripts # Call this script from /path/to/joeynmt/scripts dir
# $ bash get_iwslt14_bpe.sh # This will create /path/to/joeynmt/test/data/iwslt14/{train | valid | test}.{en | de}
# # Make sure that /path/to/joeynmt/test/data/iwslt14/bpe.32000 exists, too.
# $ cd .. # now back to /path/to/joeynmt/
#
# Train: comment out the `voc_file` lines in the data section -> vocab files will be created in the training process
# $ python -m joeynmt train configs/iwslt14_deen_bpe.yaml --skip-test
##############

@may- may- closed this as completed Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants