Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure correct model tokenizer directory structure exists prior to copying #216

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mcognetta
Copy link
Contributor

This resolves #214 by checking if the correct directory structure exists within the model directory before copying the tokenizer information over. It also standardizes the two tokenizer copy_cfg_file implementations to check that they are not overwriting anything.

One note is that I changed self.codes to self.codes.name in the return statement of the subword nmt tokenizer. I found that it is always a TextIOWrapper, but perhaps there is another case where this isn't true and we should check for it. Any ideas?

@may-
Copy link
Collaborator

may- commented Sep 3, 2023

@mcognetta
Thank you for catching the error and the correction!
You are right, we forgot to add the file existence check in subword-nmt class, and yes, it should be self.codes.name, not self.codes.

But I hesitate to add os.makedirs() here. I'd like to limit the directory creation in

def make_model_dir(model_dir: Path, overwrite: bool = False) -> Path:
only.
We intentionally separated the vocab generation from train(), which means the codes file should locate outside the model dir after scripts/build_vocab.py. I've seen your error message refers the path under the model dir /home/marco/github/joeynmt/models/transformer_iwslt14_deen_bpe/test/data/iwslt14/bpe.6000.de, but we don't want to allow that the model dir exits before train() call.

I personally believe the core of this problem would be the relative path in the config file. We should rather encourage the user to edit the file path (i.e. force to use the absolute path /home/marco/github/joeynmt/test/data/iwslt14/bpe.6000.de) in the config file. What do you think?

@mcognetta
Copy link
Contributor Author

Ah, yes an absolute path would solve this. However, are there other instances where you force an absolute rather than relative path in JoeyNMT? If not, maybe it is too rare of a requirement and will be a rough edge.

@may-
Copy link
Collaborator

may- commented Sep 4, 2023

Actually, I didn't face the error in my environment, the relative path worked nicely...
I don't know why it happened to you.

... and I'm wondering why this assertion couldn't catch the error:

codes_file = Path(kwargs["codes"])
assert codes_file.is_file(), f"codes file {codes_file} not found."

@mcognetta
Copy link
Contributor Author

Sorry for the slow reply. Hmm, it is interesting that it works on your machine but not mine (actually, I was able to reproduce it on three separate machines). I will do some more digging! Could I see your exact config file? It could be that in mine, the bpe vocabs are explicitly created in the data directory (codes: "test/data/iwslt14/bpe.32000"). Then, the model directory should also have a test/data/iwslt directory to store them. If, for example, they were specfiied just as codes: "bpe.32000" , then they should be copied directly to the base model directory. I will test this out.

codes_file = Path(kwargs["codes"])
assert codes_file.is_file(), f"codes file {codes_file} not found."

Regarding these lines: If I understand correctly, this is checking that the vocab file exists in the data directory, before it is copied over to the model directory (which is what my PR ensures exists). What do you think?

Understood about not wanting to do too much directory creation. If my guess about the codes being in a nested directory is correct, maybe the middle ground solution is to check that voc_path, model_file, etc are all just file names and not relative paths (or perhaps we always put them in a tokenizer directory or something). This would avoid arbitrary directory structures and still solve the problem.

@may-
Copy link
Collaborator

may- commented Sep 15, 2023

@mcognetta
Now I think the problem was rather in get_iwslt14_bpe.sh script. Please have a look at main...may-:joeynmt:dev
And I also forgot to cast the codes file IO object to pathlib's Path object here:

self.codes: Path = bpe_args.codes

It must be: self.codes: Path = Path(bpe_args.codes.name), instead. 😵‍💫 (Alternatively self.codes: Path = codes_file using the object codes_file assigned in line 249.)

The preprocessing went well with the fix above.
Steps to reproduce:

git clone https://github.com/may-/joeynmt.git
cd joeynmt
git checkout dev
python -m pip install -e .
cd scripts                 # don't call `bash scripts/get_iwslt14_bpe.sh` from the /path/to/joeynmt/ dir!
bash get_iwslt14_bpe.sh    # this will create /path/to/joeynmt/test/data/iwslt14/{train | valid | test}.{en | de}
cd ..                      # now back to /path/to/joeynmt/
python scripts/build_vocab.py configs/iwslt14_deen_bpe.yaml --joint # this will create /path/to/joeynmt/test/data/iwslt14/{bpe.32000 | bpe_vocab.txt}
python -m joeynmt train configs/iwslt14_deen_bpe.yaml

Sorry for the confusion. I hope now the point is clearer. ... or does your question still remain unanswered??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Basic iwslt config train failure due to directory errors
2 participants