Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

利用roberta_zh的tokenizer来做中文NER任务时报错 #93

Open
Honma-Rika opened this issue Aug 11, 2021 · 2 comments
Open

利用roberta_zh的tokenizer来做中文NER任务时报错 #93

Honma-Rika opened this issue Aug 11, 2021 · 2 comments

Comments

@Honma-Rika
Copy link

Honma-Rika commented Aug 11, 2021

想利用roberta_zh的tokenizer来做中文NER任务,用huggingface transformers官方的run_ner.py脚本作模板跑本地中文模型和数据,但在本地数据集通过datasets.load_dataset()读入后报错如下:

Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/app/ner_longformer/run_ner.py", line 600, in
main()
File "/home/app/ner_longformer/run_ner.py", line 427, in main
desc="Running tokenizer on train dataset",
File "/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1673, in map
desc=desc,
File "/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 185, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/datasets/fingerprint.py", line 397, in wrapper
out = func(self, *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 2010, in _map_single
offset=offset,
File "/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1896, in apply_function_on_filtered_inputs
function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
File "/home/app/ner_longformer/run_ner.py", line 394, in tokenize_and_align_labels
word_ids = tokenized_inputs.word_ids(batch_index=i)
File "/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 353, in word_ids
raise ValueError("word_ids() is not available when using Python-based tokenizers")
ValueError: word_ids() is not available when using Python-based tokenizers

load_dataset()通过脚本将.json数据集读入,_generate_examples()得到的数据格式如下:

{'id': '5', 'tokens': '此处为文本内容【2352JF987】夹杂一些编号信息。', 'ner_tags': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-lo', 'I-lo', 'I-lo', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-con', 'I-con', 'I-con', 'I-con', 'I-con', 'I-con', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}

即:

{
"id": str(guid),
"tokens": tokens,
"ner_tags": ner_tags,
}

@brightmart
Copy link
Owner

请参考这里的baseline代码:https://github.com/cluebenchmark/cluener

@Honma-Rika
Copy link
Author

感谢,也还想知道通过datasets.load_dataset来加载ner格式dataset的方法,因为整体用了transformers的框架

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants