-
Notifications
You must be signed in to change notification settings - Fork 405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
利用roberta_zh的tokenizer来做中文NER任务时报错 #93
Comments
请参考这里的baseline代码:https://github.com/cluebenchmark/cluener |
感谢,也还想知道通过datasets.load_dataset来加载ner格式dataset的方法,因为整体用了transformers的框架 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
想利用roberta_zh的tokenizer来做中文NER任务,用huggingface transformers官方的run_ner.py脚本作模板跑本地中文模型和数据,但在本地数据集通过datasets.load_dataset()读入后报错如下:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/app/ner_longformer/run_ner.py", line 600, in
main()
File "/home/app/ner_longformer/run_ner.py", line 427, in main
desc="Running tokenizer on train dataset",
File "/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1673, in map
desc=desc,
File "/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 185, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/datasets/fingerprint.py", line 397, in wrapper
out = func(self, *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 2010, in _map_single
offset=offset,
File "/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1896, in apply_function_on_filtered_inputs
function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
File "/home/app/ner_longformer/run_ner.py", line 394, in tokenize_and_align_labels
word_ids = tokenized_inputs.word_ids(batch_index=i)
File "/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 353, in word_ids
raise ValueError("word_ids() is not available when using Python-based tokenizers")
ValueError: word_ids() is not available when using Python-based tokenizers
load_dataset()通过脚本将.json数据集读入,_generate_examples()得到的数据格式如下:
{'id': '5', 'tokens': '此处为文本内容【2352JF987】夹杂一些编号信息。', 'ner_tags': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-lo', 'I-lo', 'I-lo', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-con', 'I-con', 'I-con', 'I-con', 'I-con', 'I-con', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}
即:
{
"id": str(guid),
"tokens": tokens,
"ner_tags": ner_tags,
}
The text was updated successfully, but these errors were encountered: