Skip to content

trained models (with training scripts) for use across different projects

Notifications You must be signed in to change notification settings

OpenJarbas/ModelZoo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ModelZoo

trained models (with training scripts) for use across different projects

pip install JarbasModelZoo

Models

this package includes utility methods to (down)load models

training scripts can be found in the train folder

NER

model_id language dataset accuracy
nltk_clftagger_conll2003_NER en CONLL2003 0.874%
nltk_clftagger_gmb_NER en GMB 2.2.0 0%
nltk_clftagger_slsmovies_NER en MIT Movie Corpus 0%
nltk_clftagger_slstrivia10k13_NER en MIT Movie Corpus - Trivia 0.806%
nltk_clftagger_slsrestaurants_NER en MIT Restaurant Corpus 0%
nltk_clftagger_onto5_NER en OntoNotes-5.0-NER-BIO 0.910%
nltk_clftagger_paramopama_NER pt Paramopama 0%
nltk_clftagger_paramopama+harem_NER pt Paramopama + HAREM (v2) 0%
nltk_clftagger_WNUT17_NER en WNUT17 0%
nltk_clftagger_leNERbr_NER pt-br leNER-Br 0%

POSTAG

model_id language dataset tagset accuracy
nltk_floresta_macmorpho_brill_tagger pt floresta + macmorpho universal 0%
nltk_brown_brill_tagger en brown brown 0.941%
nltk_brown_maxent_tagger en brown brown 0%
nltk_brown_ngram_tagger en brown brown 0.930%
nltk_floresta_brill_tagger pt floresta VISL (Portuguese) 0.938%
nltk_floresta_ngram_tagger pt floresta VISL (Portuguese) 0.925%
nltk_cess_cat_udep_brill_tagger ca cess_cat_udep Universal Dependencies 0.974%
nltk_cess_esp_udep_brill_tagger es cess_esp_udep Universal Dependencies 0.975%
nltk_macmorpho_unvtagset_brill_tagger pt macmorpho Universal Dependencies 0.966%
nltk_onto5_brill_tagger en OntoNotes-5.0-NER-BIO Penn Treebank 0%
nltk_treebank_clftagger en treebank Penn Treebank 0%
nltk_treebank_brill_tagger en treebank Penn Treebank 0%
nltk_treebank_ngram_tagger en treebank Penn Treebank 0%
nltk_treebank_maxent_tagger en treebank Penn Treebank 0%
nltk_treebank_tnt_tagger en treebank Penn Treebank 0%
nltk_nilc_brill_tagger pt-br NILC_taggers NILC 0.881%
nltk_nilc_ngram_tagger pt-br NILC_taggers NILC 0.869%
nltk_cess_cat_brill_tagger ca cess_cat EAGLES 0.939%
nltk_cess_esp_brill_tagger es cess_esp EAGLES 0.926%
nltk_macmorpho_brill_tagger pt macmorpho 0%

Security Concerns With the Python pickle Module

The serialization process is very convenient when you need to save your object’s state to disk or to transmit it over a network.

However, there’s one more thing you need to know about the Python pickle module: It’s not secure. the __setstate__ method is great for doing more initialization while unpickling, but it can also be used to execute arbitrary code during the unpickling process!

So, what can you do to reduce this risk? Train the models yourself with the provided scripts!

Usage

Postag

from nltk import word_tokenize
from JarbasModelZoo import load_model

# will auto download if missing
# ~/.local/share/JarbasModelZoo/brill_tagger_floresta_mcmorpho_pt.pkl
tagger = load_model("brill_tagger_floresta_mcmorpho_pt")
tokens = word_tokenize("Olá, o meu nome é Joaquim")
postagged = tagger.tag(tokens)
# [('Olá', 'NOUN'), (',', '.'), ('o', 'DET'), ('meu', 'PRON'), ('nome', 'NOUN'), ('é', 'VERB'), ('Joaquim', 'NOUN')]

# ~/.local/share/JarbasModelZoo/brill_tagger_cess_es.pkl
tagger = load_model("brill_tagger_cess_es")
tokens = word_tokenize("Hola, mi nombre es Daniel")
postagged = tagger.tag(tokens)
# [('Hola', 'NOUN'), (',', 'fc'), ('mi', 'DET'), ('nombre', 'NOUN'), ('es', 'VERB'), ('Daniel', 'NOUN')]

# ~/.local/share/JarbasModelZoo/brill_tagger_cess_ca.pkl
tagger = load_model("brill_tagger_cess_ca")
tokens = word_tokenize("Quién es el presidente de Cataluña?")
postagged = tagger.tag(tokens)
# [('Quién', 'NOUN'), ('es', 'PRON'), ('el', 'DET'), ('presidente', 'NOUN'), ('de', 'ADP'), ('Cataluña', 'NOUN'), ('?', 'fit')]