We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some entities in the tmvar_v3 dataset have wrong entity offsets
from datasets import load_dataset dataset_name= "tmvar_v3" dataset = load_dataset(f"bigbio/{dataset_name}", name=f"{dataset_name}_bigbio_kb") doc_id = "21904390" def check_offsets(doc_id): text = dataset[split].filter(lambda x: x["document_id"] == doc_id)[0]["passages"][0]["text"][0] \ + " " + dataset[split].filter(lambda x: x["document_id"] == doc_id)[0]["passages"][1]["text"][0] sentences = text.split(". ") sentence_indexes = [m.start() + 2 for m in re.finditer("\. ", text)] # because the suffix is ". " sentence_indexes = [0] + sentence_indexes doc_entities = dataset[split].filter(lambda x: x["document_id"] == doc_id)[:]["entities"][0] print(sentence_indexes) print(len(sentences)) print(text) print(doc_entities) sentence_index = 0 entity_index = 0 current_offset = 0 next_sentence_offset = 0 next_entity_offset = text[current_offset:].find(doc_entities[entity_index]["text"][0]) while True: if sentence_index >= len(sentence_indexes) and entity_index >= len(doc_entities): break if next_sentence_offset <= next_entity_offset: sentence_end = sentence_indexes[sentence_index] + len(sentences[sentence_index]) + 2 print(f"Sentence {sentence_index} @ offsets {sentence_indexes[sentence_index]} to {sentence_end}") print(sentences[sentence_index] + ". ") sentence_index += 1 if sentence_index >= len(sentence_indexes): next_sentence_offset = len(text) else: next_sentence_offset = sentence_indexes[sentence_index] # print(f"DEBUG next_sentence_offset: {next_sentence_offset}") else: # next_entity_offset < next_sentence_offset entity = doc_entities[entity_index] entity_name = entity["text"][0] given_offset_start = entity["offsets"][0][0] given_offset_end = entity["offsets"][0][1] print(f" {entity_name} @ offsets (real) {next_entity_offset} to {next_entity_offset + len(entity_name)}") print(f" {text[given_offset_start:given_offset_end]} @ offset (given) {given_offset_start} to {given_offset_end}") current_offset = next_entity_offset + len(entity_name) entity_index += 1 if entity_index >= len(doc_entities): next_entity_offset = len(text) else: next_entity_offset = current_offset + text[current_offset:].find(doc_entities[entity_index]["text"][0]) # print(f"DEBUG next_entity_offset: {next_entity_offset}") check_offsets(doc_id)
In Pubmed ID 21904390, expected offsets for these seven entities "entity name, (offset_start, offset_end)" are as follows:
PAX6 (342, 346) PAX6 (751, 755) PAX6 (1153, 1157) PAX6 (1483, 1487) PAX6 (1627, 1631) DKFZ p686k1684 (1640, 1654) PAX6 (2037, 2041).
Offsets in the tmvar_v3 dataset for the seven entities are as follows:
PAX6 (343, 347) PAX6 (753, 757) PAX6 (1156, 1160) PAX6 (1487, 1491) PAX6 (1631, 1635) DKFZ p686k1684 (1645, 1659) PAX6 (2043, 2047)
All the remaining entity offsets in the tmvar_v3 dataset seem to be correct.
The text was updated successfully, but these errors were encountered:
Successfully merging a pull request may close this issue.
Describe the bug
Some entities in the tmvar_v3 dataset have wrong entity offsets
Steps to reproduce the bug
Python Code
Expected results
In Pubmed ID 21904390, expected offsets for these seven entities "entity name, (offset_start, offset_end)" are as follows:
Actual results
Offsets in the tmvar_v3 dataset for the seven entities are as follows:
All the remaining entity offsets in the tmvar_v3 dataset seem to be correct.
The text was updated successfully, but these errors were encountered: