Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert-hf-to-gguf-update.py breaks #7207

Open
CrispStrobe opened this issue May 10, 2024 · 15 comments
Open

convert-hf-to-gguf-update.py breaks #7207

CrispStrobe opened this issue May 10, 2024 · 15 comments

Comments

@CrispStrobe
Copy link
Contributor

just realized that seemingly some recent changes make the script break on creating the llama-spm contents. it runs through without that line. which is my quick and lazy workaround atm (also in a quickly hacked kaggle script to run through the steps to fix the pre tokenizer issue). sorry i cannot look into this further, and maybe it is just some intermediate inconsistency that gets solved in the process of the current edits in the repo. or maybe you want to look into it.

@ProjectAtlantis-dev
Copy link

What is the error? Was it trying to download a tokenizer from hf? I know that dbrx fails

The older convert throws a NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()") when trying to do llama3 refuel

@ProjectAtlantis-dev
Copy link

I get this error from convert-hf-to-gguf-update.py using python 3.11 when trying to convert llam3a refuel:

OSError: models/tokenizers/llama-spm does not appear to have a file named config.json. Checkout 'https://huggingface.co/models/tokenizers/llama-spm/tree/None' for available files.

@CrispStrobe
Copy link
Contributor Author

CrispStrobe commented May 10, 2024

yes the very same error, or also FileNotFoundError: [Errno 2] No such file or directory: 'models/tokenizers/llama-spm/tokenizer.json'

@ProjectAtlantis-dev
Copy link

ProjectAtlantis-dev commented May 10, 2024

Tried downloading llama-spm from hf directly except get 404 error - but I think we can also steal one from another llama spm based model

@CrispStrobe
Copy link
Contributor Author

why though? a) you want to work with llama3 you said, so for this you can ignore llama-spm. b) you do not want the original hf files anyway, but you want what the update script will build for you, if it works.

@ProjectAtlantis-dev
Copy link

ProjectAtlantis-dev commented May 10, 2024

I don't understand all the logic tbh but it seems to be pulling configs from hf on the fly. Also, I think llama 3 refuel is bpe so yeah why should I even care

@CrispStrobe
Copy link
Contributor Author

i just realize: maybe it will work if you just fill out the license form on https://huggingface.co/meta-llama/Llama-2-7b-hf
but i am 99% sure this was not an issue a few days ago, hm...

@ProjectAtlantis-dev
Copy link

ProjectAtlantis-dev commented May 10, 2024

i just deleted the dbrx and the llama-spm entries in the model list below line 61 and it seems to work - but then it also says I need to run a bunch of scrips to build vocabs which is something the other script would do automagically

I think your above license form is for llama2 which is spm not bpe

@CrispStrobe
Copy link
Contributor Author

CrispStrobe commented May 10, 2024

yes that is as it is intended by the devs atm. sounds more difficult than it is. you only need the one vocab actually. and you can also check out the kaggle script linked above which does it all on the fly too.

@ProjectAtlantis-dev
Copy link

From convert-hf-to-gguf.py line 367:

 # NOTE: if you get an error here, you need to update the convert-hf-to-gguf-update.py script
 #       or pull the latest version of the model from Huggingface
 #       don't edit the hashes manually!

So the entry for BPE tokenizer presumably needs to be added to the xxx-update.py script

@CrispStrobe
Copy link
Contributor Author

indeed so

@CrispStrobe
Copy link
Contributor Author

CrispStrobe commented May 10, 2024

ok i just checked it with license access and that is most probably indeed the cause. the same for dbrx. so 2 options atm, either ask for access for both repos, or delete/comment out both lines. but i would rather change the update script so that this does not break the script. ok here is a PR for that.

@ProjectAtlantis-dev
Copy link

I think the overall intention is to emulate what python AutoTokenizer apply_chat_template() already does - it goes out to hf and pulls down the template automagically

@CrispStrobe
Copy link
Contributor Author

the similarity ends after the pulling down though

@oldmanjk
Copy link

I don't understand all the logic tbh

You and me both

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants