Skip to content

Walk through to convert PMC OAS Dataset into a text corpus

License

Notifications You must be signed in to change notification settings

TextCorpusLabs/oas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OAS To Text Corpus

Python MIT license Last Updated

The National Institutes of Health has provided an excellent data source for text mining. Not only does it cover Medical journals, but other ones from mathematics to chemistry. The purpose of this repo is to convert the PMC Open Access Subset from the given format into the text corpus format we use. I.E.

  • The full corpus consisting of one or more TXT files in a single folder
  • One or more articles in a single TXT file
  • Each article will have a header in the form:
    --- {id} ---
    --- {journal} ---
    --- {title} ---
    
  • Each article will have its abstract and body extracted
  • One sentence per line
  • Paragraphs are separated by a blank line

Operation

Install

You can install the package using the following steps:

pip install using an admin prompt.

pip uninstall oas
python -OO -m pip install -v git+https://github.com/TextCorpusLabs/oas.git

or if you have the code local

pip uninstall oas
python -OO -m pip install -v c:/repos/TextCorpusLabs/oas

Run

You are responsible for getting the source files. They can be found on this FTP site. You will need to further navigate into the three sub-folders: oa_comm, oa_noncomm, and oa_other. I recommend using FileZilla. I installed my copy using Chocolatey.

You are responsible for un-compressing and validating the source files. I recommend using 7zip. I installed my copy using Chocolatey.

The reason you are responsible is because the server the NIH keeps the files on is fickle. Sometimes it will serve corrupted files. Those files need re-downloaded and re-verified, then the file inside (the files are .tar.gz) needs verified too. OAS is also HUGE. As of 2024/03/25 it is almost 500 GB in .tar form. You must make sure you have enough space before you start.

All the below commands assume the corpus is a folder of .tar files.

  1. Extracts the metadata from the corpus.
oas metadata -source c:/data/oas -dest c:/data/oas.meta.csv

The following are required parameters:

  • source is the folder containing the .tar'ed JATS files.
  • dest is the CSV file used to store the metadata.

The following are optional parameters:

  • log is the folder of raw JATS files that did not process. It defaults to empty (not saved).
  1. Convert the data to our standard format.
oas convert -source c:/data/oas -dest c:/data/oas.std

The following are required parameters:

  • source is the folder containing the .tar'ed JATS files.
  • dest is the folder for the converted TXT files.

The following are optional parameters:

  • lines is the number of lines per TXT file. The default is 250000.
  • dest_pattern is the format of the TXT file name. It defaults to {source}.{id:04}.txt. source is the source file name's stem. id is an increasing value that increments after lines are stored in a file.
  • log is the folder of raw JATS files that did not process. It defaults to empty (not saved).

Debug/Test

The code in this repo is setup as a module. Debugging and testing are based on the assumption that the module is already installed. In order to debug (F5) or run the tests (Ctrl + ; Ctrl + A), make sure to install the module as editable (see below).

pip uninstall oas
python -m pip install -e c:/repos/TextCorpusLabs/oas

About

Walk through to convert PMC OAS Dataset into a text corpus

Topics

Resources

License

Stars

Watchers

Forks

Languages