Skip to content

Token classification using Phobert Models for Vietnamese

Notifications You must be signed in to change notification settings

datnnt1997/VPhoBertTagger

Repository files navigation

🍜VPhoBertTagger

Token classification using Phobert Models for 🇻🇳Vietnamese

🏞️Environments🏞️

Get started in seconds with verified environments. Run script below for install all dependencies

bash ./install_dependencies.sh

📚Dataset📚

The input data's format of 🍜VPhoBertTagger follows VLSP-2016 format with four columns separated by a tab character, including of word, pos, chunk, and named entity. Each word which was segmented has been put on a separate line and there is an empty line after each sentence. For details, see sample data in 'datasets/samples' directory. The table below describes an example Vietnamese sentence in dataset.

Word POS Chunk NER
Dương Np B-NP B-PER
V B-VP O
một M B-NP O
chủ N B-NP O
cửa hàng N B-NP O
lâu A B-AP O
năm N B-NP O
E B-PP O
Hà Nội Np B-NP B-LOC
. CH O O

The dataset must put on directory with structure as below.

├── data_dir
|  └── train.txt
|  └── dev.txt
|  └── test.txt

🎓Training🎓

The commands below fine-tune PhoBert for Token-classification task. Models download automatically from the latest Hugging Face release

python main.py train --task vlsp2016 --run_test --data_dir ./datasets/vlsp2016 --model_name_or_path vinai/phobert-base --model_arch softmax --output_dir outputs --max_seq_length 256 --train_batch_size 32 --eval_batch_size 32 --learning_rate 3e-5 --epochs 20 --early_stop 2 --overwrite_data

or

bash ./train.sh

Arguments:

  • type (str,*required): What is process type to be run. Must in [train, test, predict, demo].
  • task (str, *optional): Training task selected in the list: [vlsp2016, vlsp2018_l1, vlsp2018_l2, vlsp2018_join]. Default: vlsp2016
  • data_dir (Union[str, os.PathLike], *required): The input data dir. Should contain the .csv files (or other data files) for the task.
  • overwrite_data (bool, *optional) : Whether not to overwirte splitted dataset. Default=False
  • load_weights (Union[str, os.PathLike], *optional): Path of pretrained file.
  • model_name_or_path (str, *required): Pre-trained model selected in the list: [vinai/phobert-base, vinai/phobert-large,...] Default=vinai/phobert-base
  • model_arch (str, *required): Punctuation prediction model architecture selected in the list: [softmax, crf, lstm_crf].
  • output_dir (Union[str, os.PathLike], *required): The output directory where the model predictions and checkpoints will be written.
  • max_seq_length (int, *optional): The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default=190.
  • train_batch_size (int, *optional): Total batch size for training. Default=32.
  • eval_batch_size (int, *optional): Total batch size for eval. Default=32.
  • learning_rate (float, *optional): The initial learning rate for Adam. Default=1e-4.
  • classifier_learning_rate (float, *optional): The initial classifier learning rate for Adam. Default=5e-4.
  • epochs (float, *optional): Total number of training epochs to perform. Default=100.0.
  • weight_decay (float, *optional): Weight deay if we apply some. Default=0.01.
  • adam_epsilon (float, *optional): Epsilon for Adam optimizer. Default=5e-8.
  • max_grad_norm (float, *optional): Max gradient norm. Default=1.0.
  • early_stop (float, *optional): Number of early stop step. Default=10.0.
  • no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.
  • run_test (bool, *optional): Whether not to run evaluate best model on test set after train. Default=False.
  • seed (bool, *optional): Random seed for initialization. Default=42.
  • num_workers (int, *optional): how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. Default=0.
  • save_step (int, *optional): The number of steps in the model will be saved. Default=10000.
  • gradient_accumulation_steps (int, *optional): Number of updates steps to accumulate before performing a backward/update pass. Default=1.

📈Tensorboard📈

The command below start Tensorboard help you follow fine-tune process.

tensorboard --logdir runs --host 0.0.0.0 --port=6006

🥇Performances🥇

All experiments were performed on an RTX 3090 with 24GB VRAM, and a CPU Xeon® E5-2678 v3 with 64GB RAM, both of which are available for rent on vast.ai. The pretrained-model used for comparison are available on HuggingFace.

VLSP 2016

Click to expand!
Model BIO-Metrics NE-Metrics Log
Accuracy Precision Recall F1-score Accuracy
(w/o 'O')
Accuracy Precision Recall F1-score
Bert-base-multilingual-cased [1] Softmax 0.9905 0.9239 0.8776 0.8984 0.9068 0.9905 0.8938 0.8941 0.8939 Maxtrix
Log
CRF 0.9903 0.9241 0.8880 0.9048 0.9087 0.9903 0.8951 0.8945 0.8948 Maxtrix
Log
LSTM_CRF 0.9905 0.9183 0.8898 0.9027 0.9178 0.9905 0.8879 0.8992 0.8935 Maxtrix
Log
PhoBert-base [2] Softmax 0.9950 0.9312 0.9404 0.9348 0.9570 0.9950 0.9434 0.9466 0.9450 Maxtrix
Log
CRF 0.9949 0.9497 0.9248 0.9359 0.9525 0.9949 0.9516 0.9456 0.9486 Maxtrix
Log
LSTM_CRF 0.9949 0.9535 0.9181 0.9349 0.9456 0.9949 0.9520 0.9396 0.9457 Maxtrix
Log
viBERT [3] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...

VLSP 2018

Level 1

Click to expand!
Model BIO-Metrics NE-Metrics Epoch
Accuracy Precision Recall F1-score Accuracy
(w/o 'O')
Accuracy Precision Recall F1-score
Bert-base-multilingual-cased [1] Softmax 0.9828 0.7421 0.7980 0.7671 0.8510 0.9828 0.7302 0.8339 0.7786 Maxtrix
Log
CRF 0.9824 0.7716 0.7619 0.7601 0.8284 0.9824 0.7542 0.8127 0.7824 Maxtrix
Log
LSTM_CRF 0.9829 0.7533 0.7750 0.7626 0.8296 0.9829 0.7612 0.8122 0.7859 Maxtrix
Log
PhoBert-base [2] Softmax 0.9896 0.7970 0.8404 0.8170 0.8892 0.9896 0.8421 0.8942 0.8674 Maxtrix
Log
CRF 0.9903 0.8124 0.8428 0.8260 0.8834 0.9903 0.8695 0.8943 0.8817 Maxtrix
Log
LSTM+CRF 0.9901 0.8240 0.8278 0.8241 0.8715 0.9901 0.8671 0.8773 0.8721 Maxtrix
Log
viBERT [3] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...

Level 2

Click to expand!
Model BIO-Metrics NE-Metrics Epoch
Accuracy Precision Recall F1-score Accuracy
(w/o 'O')
Accuracy Precision Recall F1-score
Bert-base-multilingual-cased [1] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...
PhoBert-base [2] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM+CRF ... ... ... ... ... ... ... ... ... ...
viBERT [3] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...

Join

Click to expand!
Model BIO-Metrics NE-Metrics Epoch
Accuracy Precision Recall F1-score Accuracy
(w/o 'O')
Accuracy Precision Recall F1-score
Bert-base-multilingual-cased [1] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...
PhoBert-base [2] ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM+CRF ... ... ... ... ... ... ... ... ... ...
viBERT [3] Softmax ... ... ... ... ... ... ... ... ... ...
CRF ... ... ... ... ... ... ... ... ... ...
LSTM_CRF ... ... ... ... ... ... ... ... ... ...

References

[1] Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).

[2] Nguyen, D. Q., & Nguyen, A. T. (2020, November). PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 1037-1042).

[3] The, V. B., Thi, O. T., & Le-Hong, P. (2020). Improving sequence tagging for vietnamese text using transformer-based neural models. arXiv preprint arXiv:2006.15994.

🧠Inference🧠

The command below load your fine-tuned model and inference in your text input.

python main.py predict --model_path outputs/best_model.pt

Arguments:

  • type (str,*required): What is process type to be run. Must in [train, test, predict, demo].
  • model_path (Union[str, os.PathLike], *optional): Path of pretrained file.
  • no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.

🌟Demo🌟

The command below load your fine-tuned model and start demo page.

python main.py demo --model_path outputs/best_model.pt

Arguments:

  • type (str,*required): What is process type to be run. Must in [train, test, predict, demo].
  • model_path (Union[str, os.PathLike], *optional): Path of pretrained file.
  • no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.

💡Acknowledgements💡

Pretrained model Phobert by VinAI Research and Pytorch implementation by Hugging Face.