"ArabNizer" | An Arabic NER Tagger

in Natural Language Processing, NER stands for Named Entity Recognition. It is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into predefined categories such as the names of persons, organizations, locations, etc. NER systems typically use machine learning techniques to automatically identify and categorize entities within a body of text. It's widely used in various natural language processing applications like information retrieval, question answering, text summarization, and more.

Here are some common named entity tags:

Tag	Arabic Tag	Description
PER	اسم شخص	refers to named entities that represent individual people or characters, whether real or fictional, mentioned within a text.
LOC	اسم عنوان	represents named entities that denote specific locations, such as countries, cities, landmarks, and geographical features.
ORG	اسم مؤسسة	denotes named entities that refer to organizations, institutions, companies, agencies, or any group of people with a common purpose or affiliation.

Running Demo:

ArabNizer.mp4

Usage:

Demo link: https://arabnizer.streamlit.app/

Hugging Face model card: https://huggingface.co/mohammedaly22/arabnizer-xlmr-panx-ar

Downloading and Running ArabNizer Locally:

Clone this repository

git clone https://github.com/MohammedAly22/ArabNizer

Inside the project folder, run the demo using the following command

streamlit run main.py

Use the Model as a Hugging Face pipeline:

from transformers import pipeline

ner_tagger = pipeline("token-classification", "mohammedaly22/arabnizer-xlmr-panx-ar")
text = 'اسمي محمد، اعمل في شركة اورانج و اسكن في القاهرة.'
ner_tagger(text, grouped_entities=True)

output:

[{'entity_group': 'PER',
  'score': 0.9486102,
  'word': 'محمد',
  'start': 5,
  'end': 9},
 {'entity_group': 'ORG',
  'score': 0.8212871,
  'word': 'اورانج',
  'start': 24,
  'end': 30},
 {'entity_group': 'LOC',
  'score': 0.9967932,
  'word': 'القاهرة',
  'start': 41,
  'end': 48}]

Dataset

The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations. The corpus is made to evaluate how to perform inference in any language (including low-resources ones like Swahili or Urdu) when only English NLI data is available at training time. One solution is cross-lingual sentence encoding, for which XNLI is an evaluation benchmark. The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages (spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks, and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil (spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the Niger-Congo languages Swahili and Yoruba, spoken in Africa.

Column	Description
tokens	A list of tokens.
ner_tags	A list of the associated NER tags for each token.
langs	A list of the language of each token.

Results

Here is the table of results after fine-tuning for 3 epochs:

Epoch	Training Loss	Validation Loss	F1	Accuracy
1	0.203100	0.207782	0.859793	0.942558
2	0.155900	0.204332	0.873610	0.948951
3	0.099200	0.207264	0.888544	0.953322

confusion matrix:

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
ArabNeizer_XLMR_PAN_X.ipynb		ArabNeizer_XLMR_PAN_X.ipynb
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
style.css		style.css
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArabNeizer_XLMR_PAN_X.ipynb

ArabNeizer_XLMR_PAN_X.ipynb

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

style.css

style.css

utils.py

utils.py

Repository files navigation

"ArabNizer" | An Arabic NER Tagger

Running Demo:

Usage:

Downloading and Running ArabNizer Locally:

Use the Model as a Hugging Face pipeline:

Dataset

Results

About

Releases

Packages

Languages

MohammedAly22/ArabNizer

Folders and files

Latest commit

History

Repository files navigation

"ArabNizer" | An Arabic NER Tagger

Running Demo:

Usage:

Downloading and Running ArabNizer Locally:

Use the Model as a Hugging Face pipeline:

Dataset

Results

About

Topics

Resources

Stars

Watchers

Forks

Languages