Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for CHEBI (Chapati) #113

Open
leonweber opened this issue Feb 17, 2022 · 9 comments · May be fixed by #525
Open

Create dataset loader for CHEBI (Chapati) #113

leonweber opened this issue Feb 17, 2022 · 9 comments · May be fixed by #525
Assignees

Comments

@leonweber
Copy link
Collaborator

Task: NER
License: Creative Commons
Format: custom
Language: English
Citation: ???

Referenced and used by "Habibi, Maryam, et al. "Deep learning with word embeddings improves biomedical named entity recognition." Bioinformatics"

Source: http://chebi.cvs.sourceforge.net/viewvc/chebi/chapati/

@napsternxg
Copy link

#self-assign

@hakunanatasha
Copy link
Collaborator

Hi @napsternxg, can you let us know if you are still working on this so we can update our project board? Please just notify us the status by Friday April 8, no worries if you are not finished but intend to work on this. Please either ping me here at @hakunanatasha or ping the discord admins (with @admins)

@napsternxg
Copy link

Hi @hakunanatasha yes I plan to work on this over the weekend.

napsternxg added a commit to napsternxg/biomedical that referenced this issue Apr 11, 2022
@napsternxg
Copy link

I have started working on this dataset. I will send a PR soon.

@napsternxg
Copy link

Hi @hakunanatasha and @leonweber I have a few questions on how to parse the data. Code related to my questions is in: https://colab.research.google.com/drive/1Ne8A76yn0vxwKkpU7l_OzGI968B-YieJ?usp=sharing

  • The data is in modified HTML format. I am able to parse is via beautiful soup library but that library is not part of our requirements file. What would be the best way to proceed? E.g. if if try to load the file via:
filepath = "./scrapbook/WO2007000651/source.xml"
reader = biocxml.BioCXMLDocumentReader(str(filepath))

I get the error:

AttributeError: 'BioCXMLDocumentReader' object has no attribute '_BioCXMLDocumentReader__document'
  • The data download requires CVS to be installed. How to should I address this, should I include a note on adding this. Is it better to just process the data and upload the processed data to huggingface dataset hub?

@jason-fries
Copy link
Member

Hi @napsternxg
Sorry about the delay in responding!

  • Let's remove the CVS dependency. The original gold data is open ("This work is distributed under the Creative Commons license: http://creativecommons.org/licenses/by/3.0/") so I would download the files and put them somewhere open (e.g., google drive link) and then we can eventually host the files on the biomedical community hub (see our BIOSSES example which does this).
  • The BioCXMLDocumentReader assumes you are using a BioC formatted file, so it won't work (that I know of) with standard or nonstandard XML files. The XML package available by default in Python might work here. If not, go ahead and use BeautifulSoup and we can discuss adding it to our supported packages.

@napsternxg
Copy link

Hi @jason-fries thanks for the response.
I will download and upload the files somehere.
I will try to use the XML parser in python if it doesn't add beautifulsoup.

I plan to submit it early next week.

@napsternxg
Copy link

Downloaded the files from CVS and uploading it here for usage. We can later move it to HF datasets and update the URL in the code.
PatentAnnotations_GoldStandard.tar.gz

@napsternxg
Copy link

Added PR: #525

@napsternxg napsternxg linked a pull request May 5, 2022 that will close this issue
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: PR in Progress
Development

Successfully merging a pull request may close this issue.

4 participants