Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Natural Language Transformers for Classification #7

Open
trisongz opened this issue Mar 27, 2020 · 4 comments
Open

Using Natural Language Transformers for Classification #7

trisongz opened this issue Mar 27, 2020 · 4 comments

Comments

@trisongz
Copy link

Glad I stumbled upon this project - was working on a theory using the same base dataset.

Since protein/genes are essentially sequences of letters, it led me to the idea of using Transformer models like BERT to classify sequences to their structure. If that theory was valid, I'd want to try a multi-task approach to pairing the valid treatment sequence to the virus sequence and look at whether the model can predict the treatment sequence given the input virus sequence.

I haven't studied the structure as much as you guys probably have - so I'd defer to you on whether this would be plausible/feasible given what we know so far.

Here's a few other starting points I've looked at:

ReSimNet: Drug Response Similarity Prediction using Siamese Neural Networks
Jeon and Park et al., 2018

https://github.com/dmis-lab/ReSimNet

BERN is a BioBERT-based multi-type NER tool that also supports normalization of extracted entities.

https://github.com/dmis-lab/bern

@geohot
Copy link
Owner

geohot commented Mar 27, 2020

Hmm, so I don't know what you mean by "treatment sequence." Usually, I've seem these transformer models trained as big unsupervised predictors of the next character.

@trisongz
Copy link
Author

The idea would be modeling it after something like the SQuAD/SWAG dataset for Question Answer, where you have typically a large body of text as initial context (virus sequence), followed by the answer and the positions of the spans for that answer, if found in text (vaccine/cure sequence).

Example of a BioBERT dataset formatted for SQuAD:
https://storage.googleapis.com/ce-covid-public/BioASQ-6b/train/Full-Abstract/BioASQ-train-factoid-6b-full-annotated.json

Additional dataset from BioASQ:
https://storage.cloud.google.com/ce-covid-public/2ndYearDatasetTask2b.json

I also compiled additional sequence data which may or may not overlap with the download script you had.

https://drive.google.com/drive/folders/18aAuP3OhGMLKV8jZpt_8vpLY5JSqOS9E?usp=sharing

There are 3 sets - Coronaviruses, Influenzaviruses, and SARS related. The jsonl files are the raw data information that was compiled by filtering for complete sequences, and the virus families, and then using the accession code to download the sequences, which are the json files - so they should match the same format as your allseq.json file

  • 11132 sequences for Influenza
  • 3002 sequences for Coronavirus
  • 2023 sequences for SARS

@amoux
Copy link

amoux commented Mar 28, 2020

@trisongz I downloaded the files and put something together. Let me know if it's similar to what you are suggesting? By the way, I am familiar with the transformers library, and I don't think you can use the pre-trained language models (vocabulary) for these types of sequences. Anyways, here's the Colab link of what I put together - let me know if it's related!

Colab-Notebook

@trisongz
Copy link
Author

@amoux That's pretty awesome! I hadn't thought of using a node graph, mainly because I don't work with them as often as I'd like to.

So I've been messing around with different methods and out of the box, transformers won't necessarily work. You pointed out the first one, which is creating the vocabulary. There wasn't a single number that every sequence was divisible by, so what I did instead was process the sequences to find the lowest prime number for that given sequence, and split the sequence by that prime.

## working file - covseq.json

Total Non-Unique Primes: 8297

Total Unique Primes: 1998

Unique Primes: [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31,
35, 37, 41, 43, 45, 47, 49, 50, 53, 54, 57, 59, 61, 62, 63, 67, 71, 73, 
77, 79, 83, 85, 89, 91, 95, 97, 100, 101, 103, 106, 107, 108, 109, 113, 
115, 119, 121, 123, 124, 125, 126, 127, 129, 131, 133, 135, 137, 139, 
143, 145, 149, 151, 155, 157, 161, 163, 167, 171, 173, 175, 179, 181, 
183, 187, 189, 191, 193, 197, 199, 200, 201, 203, 205, 209, 211..]

Afterwards, I compiled all the split sequence chunks into a list, and deduplicated the list to have a remaining list of unique sequence chunks.

fluseq.json has 251607 tokens

covseq.json has 215855 tokens

sarseq.json has 96971 tokens

Total Non-Unique Tokens: 564433

Total Unique Tokens: 208565


ATGGAGAGAATAAAAGAACTGAGAGATCTAATGTCGCAGTCCCGCACTCGCGAGATACTCACTAAGACCACTGTGGACCATATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCCCGCACTCAGAATGAAGTGGATGATGGCAATGAGATACCCAATTACAGCAGACAAGAGAATAATGGACATGATTCCAGAGAGGAATGAACAAGGACAAACCCTCTGGAGCAAAACAAACGATGCTGGATCAGACCGAGTGATGGTATCACCTCTGGCCGTAACATGGTGGAATAGGAATGGCCCAACAACAAGTACAGTTCATTACCCTAAGGTATATAAAACTTATTTCGAAAAGGTCGAAAGGTTGAAACATGGTACCTTCGGCCCTGTCCACTTCAGAAATCAAGTTAAAATAAGGAGGAGAGTTGATACAAACCCTGGCCATGCAGATCTCAGTGCCAAGGAGGCACAGGATGTGATTATGGAAGTTGTTTTCCCAAATGAAGTGGGGGCAAGAATACTGACATCAGAGTCACAGCTGGCAATAACAAAAGAGAAGAAAGAAGAGCTCCAGGATTGTAAAATTGCTCCCTTGATGGTGGCGTACATGCTAGAAAGAGAATTGGTCCGTAAAACAAGGTTTCTCCCAGTAGCCGGCGGAACAGGCAGTGTTTATATTGAAGTGTTGCACTTAACCCAAGGGACGTGCTGGGAGCAGATGTACACTCCAGGAGGAGAAGTGAGAAATGATGATGTTGACCAAAGTTTGATTATCGCTGCTAGAAACATAGTAAGAAGAGCAGCAGTGTCAGCAGACCCATTAGCATCTCTCTTGGAAATGTGCCACAGCACACAGATTGGAGGAGTAAGGATGGTGGACATCCTTAGACAGAATCCAACTGAGGAACAAGCCGTAGACATATGCAAGGCAGCAATAGGGTTGAGGATTAGCTCATCTTTCAGTTTTGGTGGGTTCACTTTCAAAAGGACAAGCGGATCATCAGTCAAGAAAGAAGAAGAAGTGCTAACGGGCAACCTCCAAACACTGAAAATAAGAGTACATGAAGGGTATGAAGAATTCACAATGGTTGGGAGAAGAGCAACAGCTATTCTCAGAAAGGCAACCAGGAGA

Still a massive vocab for most models, so I tried using XLNet (the values are a bit messed up here - realized I had 1 as a prime, as seen in the above, which led to much smaller size)

import torch
from transformers import *

tokenizer = XLNetTokenizer.from_pretrained(''xlnet-base-cased')
model = XLNetModel.from_pretrained('xlnet-base-cased')

num_added_toks = tokenizer.add_tokens(complete_tokens) # list of the deduplicated tokens
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer))

>> We have added 65134 tokens
>> Embedding(97134, 768)

This is where I'm currently at. My first goal is to attempt for Sequence Classification/Entailment. Stuck on how to pre-process the data into the correct format for that task.

Also - I realized that the flu dataset is a lot smaller than it should be, so I'll reupload the updated version in the folder soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants