Ensure correct model tokenizer directory structure exists prior to copying #216

mcognetta · 2023-08-25T05:03:33Z

This resolves #214 by checking if the correct directory structure exists within the model directory before copying the tokenizer information over. It also standardizes the two tokenizer copy_cfg_file implementations to check that they are not overwriting anything.

One note is that I changed self.codes to self.codes.name in the return statement of the subword nmt tokenizer. I found that it is always a TextIOWrapper, but perhaps there is another case where this isn't true and we should check for it. Any ideas?

may- · 2023-09-03T23:55:41Z

@mcognetta
Thank you for catching the error and the correction!
You are right, we forgot to add the file existence check in subword-nmt class, and yes, it should be self.codes.name, not self.codes.

But I hesitate to add os.makedirs() here. I'd like to limit the directory creation in

joeynmt/joeynmt/helpers.py

Line 37 in 0968187

def make_model_dir(model_dir: Path, overwrite: bool = False) -> Path:

only.
We intentionally separated the vocab generation from train(), which means the codes file should locate outside the model dir after scripts/build_vocab.py. I've seen your error message refers the path under the model dir /home/marco/github/joeynmt/models/transformer_iwslt14_deen_bpe/test/data/iwslt14/bpe.6000.de, but we don't want to allow that the model dir exits before train() call.

I personally believe the core of this problem would be the relative path in the config file. We should rather encourage the user to edit the file path (i.e. force to use the absolute path /home/marco/github/joeynmt/test/data/iwslt14/bpe.6000.de) in the config file. What do you think?

mcognetta · 2023-09-04T08:49:03Z

Ah, yes an absolute path would solve this. However, are there other instances where you force an absolute rather than relative path in JoeyNMT? If not, maybe it is too rare of a requirement and will be a rough edge.

may- · 2023-09-04T14:43:01Z

Actually, I didn't face the error in my environment, the relative path worked nicely...
I don't know why it happened to you.

... and I'm wondering why this assertion couldn't catch the error:

joeynmt/joeynmt/tokenizers.py

Lines 249 to 250 in 0968187

 codes_file = Path(kwargs["codes"]) 

 assert codes_file.is_file(), f"codes file {codes_file} not found."

mcognetta · 2023-09-15T00:53:48Z

Sorry for the slow reply. Hmm, it is interesting that it works on your machine but not mine (actually, I was able to reproduce it on three separate machines). I will do some more digging! Could I see your exact config file? It could be that in mine, the bpe vocabs are explicitly created in the data directory (codes: "test/data/iwslt14/bpe.32000"). Then, the model directory should also have a test/data/iwslt directory to store them. If, for example, they were specfiied just as codes: "bpe.32000" , then they should be copied directly to the base model directory. I will test this out.

joeynmt/joeynmt/tokenizers.py

Lines 249 to 250 in 0968187

 codes_file = Path(kwargs["codes"]) 

 assert codes_file.is_file(), f"codes file {codes_file} not found."

Regarding these lines: If I understand correctly, this is checking that the vocab file exists in the data directory, before it is copied over to the model directory (which is what my PR ensures exists). What do you think?

Understood about not wanting to do too much directory creation. If my guess about the codes being in a nested directory is correct, maybe the middle ground solution is to check that voc_path, model_file, etc are all just file names and not relative paths (or perhaps we always put them in a tokenizer directory or something). This would avoid arbitrary directory structures and still solve the problem.

may- · 2023-09-15T22:30:01Z

@mcognetta
Now I think the problem was rather in get_iwslt14_bpe.sh script. Please have a look at main...may-:joeynmt:dev
And I also forgot to cast the codes file IO object to pathlib's Path object here:

joeynmt/joeynmt/tokenizers.py

Line 268 in f610d78

self.codes: Path = bpe_args.codes

It must be: self.codes: Path = Path(bpe_args.codes.name), instead. 😵‍💫 (Alternatively self.codes: Path = codes_file using the object codes_file assigned in line 249.)

The preprocessing went well with the fix above.
Steps to reproduce:

git clone https://github.com/may-/joeynmt.git
cd joeynmt
git checkout dev
python -m pip install -e .
cd scripts                 # don't call `bash scripts/get_iwslt14_bpe.sh` from the /path/to/joeynmt/ dir!
bash get_iwslt14_bpe.sh    # this will create /path/to/joeynmt/test/data/iwslt14/{train | valid | test}.{en | de}
cd ..                      # now back to /path/to/joeynmt/
python scripts/build_vocab.py configs/iwslt14_deen_bpe.yaml --joint # this will create /path/to/joeynmt/test/data/iwslt14/{bpe.32000 | bpe_vocab.txt}
python -m joeynmt train configs/iwslt14_deen_bpe.yaml

Sorry for the confusion. I hope now the point is clearer. ... or does your question still remain unanswered??

add directory check and generation

a64d0f6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure correct model tokenizer directory structure exists prior to copying #216

Ensure correct model tokenizer directory structure exists prior to copying #216

mcognetta commented Aug 25, 2023

may- commented Sep 3, 2023 •

edited

mcognetta commented Sep 4, 2023

may- commented Sep 4, 2023

mcognetta commented Sep 15, 2023

may- commented Sep 15, 2023 •

edited

Ensure correct model tokenizer directory structure exists prior to copying #216

Are you sure you want to change the base?

Ensure correct model tokenizer directory structure exists prior to copying #216

Conversation

mcognetta commented Aug 25, 2023

may- commented Sep 3, 2023 • edited

mcognetta commented Sep 4, 2023

may- commented Sep 4, 2023

mcognetta commented Sep 15, 2023

may- commented Sep 15, 2023 • edited

may- commented Sep 3, 2023 •

edited

may- commented Sep 15, 2023 •

edited