Janarish Saju C
AI/ML Engineer
Named Entity Recognition
10th December 2022
Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text
- EDA (Exploratory Data Analysis)
- How many solutions can you think of and why are you choosing your version of the solution?
- Error Analysis
There are several NER libraries for implementations using Python.
- BERT: https://huggingface.co/
- spaCy: https://spacy.io/usage/linguistic-features
- NLTK: https://www.nltk.org/book/ch07.html
- Stanford CoreNLP: https://stanfordnlp.github.io/CoreNLP/
- Polyglot: http://polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html
- Apache OpenNLP: https://opennlp.apache.org/
Among those I go with the first two BERT and spaCy, these are the most favorite of mine
There exists other popular frameworks such OpenAI, NLTK and Stanford CoreNLP as well.
Methods | Advantages | Disadvantages |
---|---|---|
BERT | Time Efficiency Pretrained with large datasets Knowledge Transfer (Transfer Learning) |
Computationally Cost Expensive Not so good for Domain Based NER |
spaCy | Faster as it is built with C++ in low level Good for Domain Based NER |
Require more training data More Complexity in Data Structures |
NLTK | Good for Base Level Analysis | Requires Implementations from Scratch |
Stanford Core NLP | No Idea, Since I never experienced these tools for any of my former projects |
|
OpenNLP |
(*written here all the necessary steps carried out in the shared code)
- Average Word Length
- Histogram and Bar Charts
- Most Influenced Words
- Most Influential Entities
- Most Dominant Entity Labels
- N-gram Exploration
(*From the above analysis we can see the corpus influences more about the Social Networks, Twitter, Facebook, Youtube, Please see the Colab Notebook)
- Found some outliers/uncommon behaviors when relating the entities with special characters. (See the Colab Notebook)
- As discussed in Outlier Analysis. Need to take care of the following
- Mismatch in name taggings
- Special Characters + Entities Overlapping
(See the Colab Notebook)
- Data Augmentation
- An effective approach if we have lesser training samples
- Data Annotation
- Better data annotation pipeline tool avoid faulty datasets from Human Level
- Ensembled Algorithms
- Effectively utilizing ensemble algorithms archives better performance.
(*Discussed in last section)
(*written here all the necessary steps carried out in the shared code)
- Data read/import
- Handle data encoding issue
- Data conversion as per model requirements
- Data partition
- Unique input and output label features
- Encode the labels to Numeric representation
- Tokenize and embed the datasets
- Initialize the BERT model
- Define the Task Name
- Define the Tokenizer method
- The following parameters were used
- evaluation_strategy = "epoch",
- learning_rate=1e-4,
- per_device_train_batch_size=16,
- per_device_eval_batch_size=16,
- num_train_epochs=6,
- weight_decay=1e-5,
- Train the model with the below metrics
- Train_dataset,
- Eval_dataset,
- Tokenizer,
- Compute_metrics
- Evaluation done based on the 20 percent of data extracted for validation purposes from the training data.
- Accuracy on Validation Dataset
- Confusion Matrix / Cross Table
- Precision, Recall, F-Measure
- K Fold Cross Validation can be applied for advanced analysis.
(* It is explicitly seen that the entity I-Location, B-Location and O have more mismatches. We should analyze and look deep into those entities. Please see the Colab Notebook)
- Read the test data from disk
- Handle Encoding and Alignment issues
- Data Conversion
- Feed the Converted Test Data to the fine turned model and Get Predictions
- Get Label Predictions using ArgMax function
- Get Probabilistic Prediction Scores using SoftMax function
- Store every results in a DataFrame
- Export test results in text file separated by "\t"
- BERT has the advantage over other Machine Learning and Deep Learning models.
- As it is a transformer technique pretrained with huge datasets.
- And it save us a lot of time for training
- Although it has a disadvantage, Heavier BERT model is computationally expensive
https://github.com/huggingface/pytorch-pretrained-BERT
https://spacy.io/usage/linguistic-features
- https://github.com/dmoonat/Named-Entity-Recognition/blob/main/Fine_tune_NER.ipynb
- https://medium.com/@andrewmarmon/fine-tuned-named-entity-recognition-with-hugging-face-bert-d51d4cb3d7b5
- https://pub.towardsai.net/top-5-approaches-to-named-entity-recognition-ner-in-2022-38afdf022bf1
- https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools
BERT NER:
Exploratory Data Analysis: