OVERVIEW

Janarish Saju C

AI/ML Engineer

Named Entity Recognition

10th December 2022

OVERVIEW

Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text

GOALS

EDA (Exploratory Data Analysis)
How many solutions can you think of and why are you choosing your version of the solution?
Error Analysis

COLDSTART ANALYSIS

There are several NER libraries for implementations using Python.

BERT: https://huggingface.co/
spaCy: https://spacy.io/usage/linguistic-features
NLTK: https://www.nltk.org/book/ch07.html
Stanford CoreNLP: https://stanfordnlp.github.io/CoreNLP/
Polyglot: http://polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html
Apache OpenNLP: https://opennlp.apache.org/

Among those I go with the first two BERT and spaCy, these are the most favorite of mine

There exists other popular frameworks such OpenAI, NLTK and Stanford CoreNLP as well.

Advantages & Disadvantages

Methods	Advantages	Disadvantages
BERT	Time Efficiency Pretrained with large datasets Knowledge Transfer (Transfer Learning)	Computationally Cost Expensive Not so good for Domain Based NER
spaCy	Faster as it is built with C++ in low level Good for Domain Based NER	Require more training data More Complexity in Data Structures
NLTK	Good for Base Level Analysis	Requires Implementations from Scratch
Stanford Core NLP	No Idea, Since I never experienced these tools for any of my former projects
OpenNLP

EXPLORATORY DATA ANALYSIS

(*written here all the necessary steps carried out in the shared code)

1. Text Statistics

Average Word Length
Histogram and Bar Charts
Most Influenced Words
Most Influential Entities
Most Dominant Entity Labels
N-gram Exploration

(*From the above analysis we can see the corpus influences more about the Social Networks, Twitter, Facebook, Youtube, Please see the Colab Notebook)

2. Outlier Analysis

Found some outliers/uncommon behaviors when relating the entities with special characters. (See the Colab Notebook)

3. Assumptions & Thought

As discussed in Outlier Analysis. Need to take care of the following
- Mismatch in name taggings
- Special Characters + Entities Overlapping

(See the Colab Notebook)

Data Augmentation
- An effective approach if we have lesser training samples
Data Annotation
- Better data annotation pipeline tool avoid faulty datasets from Human Level
Ensembled Algorithms
- Effectively utilizing ensemble algorithms archives better performance.

(*Discussed in last section)

IMPLEMENTATIONS

(*written here all the necessary steps carried out in the shared code)

BERT NER

1. Data Preprocessing Steps

Data read/import
Handle data encoding issue
Data conversion as per model requirements
Data partition

2. Feature Engineering Steps

Unique input and output label features
Encode the labels to Numeric representation
Tokenize and embed the datasets

3. Model Initialization

Initialize the BERT model
Define the Task Name
Define the Tokenizer method

4. Hyper Parameter Turning

The following parameters were used
- evaluation_strategy = "epoch",
- learning_rate=1e-4,
- per_device_train_batch_size=16,
- per_device_eval_batch_size=16,
- num_train_epochs=6,
- weight_decay=1e-5,

5. Train the Model

Train the model with the below metrics
- Train_dataset,
- Eval_dataset,
- Tokenizer,
- Compute_metrics

6. Evaluate the Model

Evaluation done based on the 20 percent of data extracted for validation purposes from the training data.

7. Error Analysis

Accuracy on Validation Dataset
Confusion Matrix / Cross Table
Precision, Recall, F-Measure
K Fold Cross Validation can be applied for advanced analysis.

(* It is explicitly seen that the entity I-Location, B-Location and O have more mismatches. We should analyze and look deep into those entities. Please see the Colab Notebook)

8. Prediction Module

Read the test data from disk
Handle Encoding and Alignment issues
Data Conversion
Feed the Converted Test Data to the fine turned model and Get Predictions
Get Label Predictions using ArgMax function
Get Probabilistic Prediction Scores using SoftMax function
Store every results in a DataFrame

9. Export the Results

Export test results in text file separated by "\t"

CONCLUSION

BERT has the advantage over other Machine Learning and Deep Learning models.
As it is a transformer technique pretrained with huge datasets.
And it save us a lot of time for training
Although it has a disadvantage, Heavier BERT model is computationally expensive

ADDITIONAL RESEARCH & FUTURE EXPERIMENTS

More Ideas for Making Stronger NER Formatting Models

1. Replace Pretrained embeddings with Contextual Embeddings such as BERT or ELMo

https://github.com/huggingface/pytorch-pretrained-BERT

2. Combine Embeddings with Character Level, CNNs or RNNs for handling unseen words

https://eli.thegreenplace.net/2018/understanding-how-to-implement-a-character-based-rnn-language-model/

3. Combine Linguistic Features with your Embeddings

https://spacy.io/usage/linguistic-features

4. Add Self-Attention Mechanisms to your RNN

https://towardsdatascience.com/deep-learning-for-named-entity-recognition-2-implementing-the-state-of-the-art-bidirectional-lstm-4603491087f1

REFERENCES & URLs

Online Sources:

Google Colab:

BERT NER:

https://colab.research.google.com/drive/19qNHO9E618JP6IVeY6DtvlPx3Bwplhiw?authuser=2#scrollTo=ISV1dQxoKrbk

Exploratory Data Analysis:

https://colab.research.google.com/drive/16-J6qeLLmEJf0lerHKS1w0rmztAq_nIg?authuser=2#scrollTo=iu65GkEF8ueH

GitHub:

https://github.com/janarishsaju/ner\_bert.git

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
BERT_NER.ipynb		BERT_NER.ipynb
Exploratory_Data_Analysis_NLP_NER.ipynb		Exploratory_Data_Analysis_NLP_NER.ipynb
README.md		README.md
outlier_analysis.csv		outlier_analysis.csv
submission.csv		submission.csv
submission.txt		submission.txt
test.txt		test.txt
train.txt		train.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OVERVIEW

GOALS

COLDSTART ANALYSIS

Advantages & Disadvantages

EXPLORATORY DATA ANALYSIS

1. Text Statistics

2. Outlier Analysis

3. Assumptions & Thought

IMPLEMENTATIONS

BERT NER

1. Data Preprocessing Steps

2. Feature Engineering Steps

3. Model Initialization

4. Hyper Parameter Turning

5. Train the Model

6. Evaluate the Model

7. Error Analysis

8. Prediction Module

9. Export the Results

CONCLUSION

ADDITIONAL RESEARCH & FUTURE EXPERIMENTS

More Ideas for Making Stronger NER Formatting Models

1. Replace Pretrained embeddings with Contextual Embeddings such as BERT or ELMo

2. Combine Embeddings with Character Level, CNNs or RNNs for handling unseen words

3. Combine Linguistic Features with your Embeddings

4. Add Self-Attention Mechanisms to your RNN

REFERENCES & URLs

Online Sources:

Google Colab:

GitHub:

About

Releases

Packages

Languages

janarishsaju/custom_ner_bert

Folders and files

Latest commit

History

Repository files navigation

OVERVIEW

GOALS

COLDSTART ANALYSIS

Advantages & Disadvantages

EXPLORATORY DATA ANALYSIS

1. Text Statistics

2. Outlier Analysis

3. Assumptions & Thought

IMPLEMENTATIONS

BERT NER

1. Data Preprocessing Steps

2. Feature Engineering Steps

3. Model Initialization

4. Hyper Parameter Turning

5. Train the Model

6. Evaluate the Model

7. Error Analysis

8. Prediction Module

9. Export the Results

CONCLUSION

ADDITIONAL RESEARCH & FUTURE EXPERIMENTS

More Ideas for Making Stronger NER Formatting Models

1. Replace Pretrained embeddings with Contextual Embeddings such as BERT or ELMo

2. Combine Embeddings with Character Level, CNNs or RNNs for handling unseen words

3. Combine Linguistic Features with your Embeddings

4. Add Self-Attention Mechanisms to your RNN

REFERENCES & URLs

Online Sources:

Google Colab:

GitHub:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages