Skip to content

Trigram files for 400+ languages

License

Notifications You must be signed in to change notification settings

wooorm/trigrams

Repository files navigation

trigrams

Build Coverage Downloads

Trigrams for 460+ languages.

Contents

What is this?

This package exposes all trigrams for natural languages. Based on the most translated copyright-free document on this planet: UDHR.

When should I use this?

When you are dealing with natural language detection.

Install

This package is ESM only. In Node.js (version 14.14+, 16.0+), install with npm:

npm install trigrams

In Deno with esm.sh:

import {top, min} from 'https://esm.sh/trigrams@5'

In browsers with esm.sh:

<script type="module">
  import {top, min} from 'https://esm.sh/trigrams@5?bundle'
</script>

Use

import {top, min} from 'trigrams'

console.log((await top()).pam)
console.log((await min()).nld)

Yields:

{ // 300 top trigrams.
  'isa': 6,
  'upa': 6,
  'i k': 6,
  // …
  'ang': 273,
  'ing': 282,
  'ng ': 572 // Most common trigram with how often it was found.
}
[ // 300 top trigrams.
  ' ar',
  'eer',
  'tij',
  // …
  'de ',
  'an ',
  'en ' // Most common trigram.
]

API

This package exports the identifiers top and min. There is no default export.

top()

Get top trigrams to occurrence counts.

Returns

Returns a promise resolving to an object mapping UDHR in Unicode codes to objects mapping the top 300 trigrams to occurrence counts (Promise<Record<string, Record<string, number>>>).

min()

Get top trigrams.

Returns

Returns a promise resolving to arrays containing the top 300 trigrams sorted from least occurring to most occurring (Promise<Record<string, Array<string>>>).

Data

The trigrams are based on the unicode versions of the universal declaration of human rights.

The files are created from all paragraphs made available by wooorm/udhr and do not include headings and such.

Before creating trigrams,

  • the unicode characters from \u0021 to \u0040 (both including) are removed
  • one or more white space characters (\s+) are replaced with a single space
  • alphabetic characters are lower cased ([A-Z])

Additionally, the input is padded with two spaces on both sides.

Code Name
007 Sãotomense
008 Crioulo, Upper Guinea (008)
009 Mbundu (009)
010 Tetun Dili
011 Umbundu (011)
013 (Mijisa)
014 (Maiunan)
016 (Minjiang, spoken)
017 (Minjiang, written)
020 Drung
021 (Muzzi)
022 (Klau)
025 (Bizisa)
026 (Yeonbyeon)
027 Gumuz
028 Kafa
029 Sidamo
030 Kituba (2)
032 South Azerbaijani
041 Latvian (2)
042 Spanish (resolution)
043 Zarma
aar Afar
abk Abkhaz
ace Aceh
acu Achuar-Shiwiar
acu_1 Achuar-Shiwiar (1)
ada Dangme
ady Adyghe
afr Afrikaans
agr Aguaruna
aii Assyrian Neo-Aramaic
ajg Aja
aka_akuapem Twi (Akuapem)
aka_asante Twi (Asante)
aka_fante Fante
als Albanian, Tosk
alt Altai, Southern
amc Amahuaca
ame Yaneshaʼ
amh Amharic
ami Amis
amr Amarakaeri
arb Arabic, Standard
arl Arabela
arn Mapudungun
ast Asturian
auc Waorani
auv Occitan (Auvergnat)
ayr Aymara, Central
azj_cyrl Azerbaijani, North (Cyrillic)
azj_latn Azerbaijani, North (Latin)
bam Bamanankan
ban Bali
bax Bamun
bba Baatonum
bci Baoulé
bcl Bicolano, Central
bel Belarusan
bem Bemba
ben Bengali
bfa Bari
bho Bhojpuri
bin Edo
bis Bislama
blt Tai Dam
blu Hmong Njua
boa Bora
bod Tibetan, Central
bos_cyrl Bosnian (Cyrillic)
bos_latn Bosnian (Latin)
bre Breton
btb Bulu
buc Bushi
bug Bugis
bul Bulgarian
cab Garifuna
cak Kaqchikel, Central
cat Catalan-Valencian-Balear
cbi Chachi
cbr Cashibo-Cacataibo
cbs Cashinahua
cbt Chayahuita
cbu Candoshi-Shapra
ccx Zhuang, Yongbei
ceb Cebuano
ces Czech
cha Chamorro
chj Chinantec, Ojitlán
chk Chuukese
chr_cased Cherokee (cased)
chr_uppercase Cherokee (uppercase)
chv Chuvash
cic Chickasaw
cjk Chokwe
cjk_AO Chokwe (Angola)
cjs Shor
ckb Kurdish, Central
cnh Chin, Haka
cni Asháninka
cnr Montenegrin
cof Colorado
cos Corsican
cot Caquinte
cpu Ashéninka, Pichis
crh Crimean Tatar
crs Seselwa Creole French
csa Chinantec, Chiltepec
csw Cree, Swampy
ctd Chin, Tedim
cym Welsh
dag Dagbani
dan Danish
ddn Dendi
deu_1901 German, Standard (1901)
deu_1996 German, Standard (1996)
dga Dagaare, Southern
dip Dinka, Northeastern
div Maldivian
dyo Jola-Fonyi
dyu Jula
dzo Dzongkha
ell_monotonic Greek (monotonic)
ell_polytonic Greek (polytonic)
emk Maninkakan, Eastern
eml Romagnolo
eng English
epo Esperanto
ese Ese Ejja
est Estonian
eus Basque
eve Even
evn Evenki
ewe Éwé
fao Faroese
fij Fijian
fin Finnish
fkv Finnish, Kven
flm Chin, Falam
fon Fon
fra French
fri Frisian, Western
fuf Pular
fur Friulian
fuv Fulfulde, Nigerian
fuv2 Fulfulde, Nigerian (2)
fvr Fur
gaa Ga
gag Gagauz
gax Oromo, Borana-Arsi-Guji
gjn Gonja
gkp Kpelle, Guinea
gla Gaelic, Scottish
gld Nanai
gle Gaelic, Irish
glg Galician
glv Manx
gsw1 Alemannisch (Elsassisch)
guc Wayuu
gug Guaraní, Paraguayan
guj Gujarati
guu Yanomamö
gyr Guarayu
hat_kreyol Haitian Creole French (Kreyol)
hat_popular Haitian Creole French (Popular)
hau_NE Hausa (Niger)
hau_NG Hausa (Nigeria)
hau_3 Hausa
haw Hawaiian
hea Hmong, Northern Qiandong
heb Hebrew
hil Hiligaynon
hin Hindi
hlt Chin, Matu
hms Hmong, Southern Qiandong
hna Gen
hni Hani
hns Hindustani, Sarnami
hrv Croatian
hsb Sorbian, Upper
hsf Huastec (Sierra de Otontepec)
hun Hungarian
hus Huastec (Veracruz)
huu Huitoto, Murui
hva Huastec (San Luís Potosí)
hye Armenian
ibb Ibibio
ibo Igbo
ido Ido
idu Idoma
ijs Ijo, Southeast
ike Inuktitut, Eastern Canadian
ilo Ilocano
ina Interlingua
ind Indonesian
isl Icelandic
ita Italian
jav Javanese (Latin)
jav_java Javanese (Javanese)
jiv Shuar
jpn Japanese
jpn_osaka Japanese (Osaka)
jpn_tokyo Japanese (Tokyo)
kaa Karakalpak
kal Inuktitut, Greenlandic
kan Kannada
kat Georgian
kaz Kazakh
kbd Kabardian
kbp Kabiyé
kde Makonde
kdh Tem
kea Kabuverdianu
kek Q'eqchi'
kha Khasi
khk Mongolian, Halh (Cyrillic)
khm Khmer, Central
kin Rwanda
kir Kirghiz
kjh Khakas
kkh_lana Khün
kmb Mbundu
kmr Kurdish, Northern
knc Kanuri, Central
kng Koongo
kng_AO Koongo (Angola)
koi Komi-Permyak
koo Konjo
kor Korean
kqn Kaonde
kqs Kissi, Northern
kri Krio
krl Karelian
ktu Kituba
kwi Awa-Cuaiquer
lad Ladino
lao Lao
lat Latin
lat_1 Latin (1)
lav Latvian
lia Limba, West-Central
lij Ligurian
lin Lingala
lin_tones Lingala (tones)
lit Lithuanian
lld Ladin
lnc Occitan (Languedocien)
lns Lamnso'
lob Lobi
lot Otuho
loz Lozi
ltz Luxembourgeois
lua Luba-Kasai
lue Luvale
lug Ganda
lun Lunda
lus Mizo
mad Madura
mag Magahi
mah Marshallese
mai Maithili
mal Malayalam
mal_chillus Malayalam
mam Mam, Northern
mar Marathi
maz Mazahua Central
mcd Sharanahua
mcf Matsés
men Mende
mfq Moba
mic Micmac
min Minangkabau
miq Mískito
mkd Macedonian
mlt Maltese
mly_arab Malay (Arabic)
mly_latn Malay (Latin)
mnw Mon
mor Moro
mos Mòoré
mri Maori
mto Mixe, Totontepec
mxi Mozarabic
mxv Mixtec, Metlatónoc
mya Burmese
mzi Mazatec, Ixcatlán
nav Navajo
nba Nyemba
nbl Ndebele
ndo Ndonga
nds Saxon, Low
nep Nepali
nhn Nahuatl, Central
nio Nganasan
niu Niue
niv Gilyak
njo Naga, Ao
nku Kulango, Bouna
nld Dutch
nno Norwegian, Nynorsk
nob Norwegian, Bokmål
not Nomatsiguenga
nso Sotho, Northern
nya_chechewa Nyanja (Chechewa)
nya_chinyanja Nyanja (Chinyanja)
nym Nyamwezi
nyn Nyankore
nzi Nzema
oaa Orok
oci_1 Occitan (Francoprovençal, Fribourg)
oci_2 Occitan (Francoprovençal, Savoie)
oci_3 Occitan (Francoprovençal, Vaud)
oci_4 Occitan (Francoprovençal, Valais)
ojb Ojibwa, Northwestern
oki Okiek
orh Oroqen
oss Osetin
ote Otomi, Mezquital
pam Pampangan
pan Panjabi, Eastern
pap Papiamentu
pau Palauan
pbb Páez
pbu Pashto, Northern
pcd Picard
pcm Pidgin, Nigerian
pes_1 Farsi, Western
pes_2 Dari
pis Pijin
piu Pintupi-Luritja
plt Malagasy, Plateau
pnb Panjabi, Western
pol Polish
pon Pohnpeian
por_BR Portuguese (Brazil)
por_PT Portuguese (Portugal)
pov Crioulo, Upper Guinea
ppl Pipil
prv Occitan
quc K'iche', Central
qud Quechua (Unified Quichua, old Hispanic orthography)
qug Quichua, Chimborazo Highland
quy Quechua, Ayacucho
quz Quechua, Cusco
qva Quechua, Ambo-Pasco
qvc Quechua, Cajamarca
qvh Quechua, Huamalíes-Dos de Mayo Huánuco
qvm Quechua, Margos-Yarowilca-Lauricocha
qvn Quechua, North Junín
qwh Quechua, Huaylas Ancash
qxa Quechua, South Bolivian
qxn Quechua, Northern Conchucos Ancash
qxu Quechua, Arequipa-La Unión
rar Rarotongan
rmn Romani, Balkan
rmn_1 Romani, Balkan (1)
rmy Aromanian
roh Romansch
roh_puter Romansch (Puter)
roh_rumgr Romansch (Grischun)
roh_surmiran Romansch (Surmiran)
roh_sursilv Romansch (Sursilvan)
roh_sutsilv Romansch (Sutsilvan)
roh_vallader Romansch (Vallader)
ron_1953 Romanian (1953)
ron_1993 Romanian (1993)
ron_2006 Romanian (2006)
run Rundi
rus Russian
sag Sango
sah Yakut
san Sanskrit
sco Scots
sey Secoya
shk Shilluk
shn Shan
shp Shipibo-Conibo
sin Sinhala
skr Seraiki
slk Slovak
slr Salar
slv Slovenian
sme Saami, North
smo Samoan
sna Shona
snk Soninke
snn Siona
som Somali
sot Sotho, Southern
spa Spanish
src Sardinian, Logudorese
srp_cyrl Serbian (Cyrillic)
srp_latn Serbian (Latin)
srr Serer-Sine
ssw Swati
suk Sukuma
sun Sunda
sus Susu
swb Comorian, Maore
swe Swedish
swh Swahili
tah Tahitian
tam Tamil
tam_LK Tamil (Sri Lanka)
tat Tatar
tbz Ditammari
tca Ticuna
tel Telugu
tem Themne
tet Tetun
tgk Tajiki
tgl Tagalog
tha Thai
tha2 Thai (2)
tir Tigrigna
tiv Tiv
tly Talysh
tob Toba
toi Tonga
toj Tojolabal
ton Tongan
top Totonac, Papantla
tpi Tok Pisin
tsn Tswana
tso_MZ Tsonga (Mozambique)
tso_ZW Tsonga (Zimbabwe)
tsz Purepecha
tuk_cyrl Turkmen (Cyrillic)
tuk_latn Turkmen (Latin)
tur Turkish
tyv Tuva
tzc Tzotzil (Chamula)
tzh Tzeltal, Oxchuc
tzm Tamazight, Central Atlas
udu Uduk
uig_arab Uyghur (Arabic)
uig_latn Uyghur (Latin)
ukr Ukrainian
umb Umbundu
ura Urarina
urd Urdu
urd_2 Urdu (2)
uzn_cyrl Uzbek, Northern (Cyrillic)
uzn_latn Uzbek, Northern (Latin)
vai Vai
vec Venetian
ven Venda
ven2 Venda
vep Veps
vie Vietnamese
vmw Makhuwa
war Waray-Waray
wln Walloon
wol Wolof
wwa Waama
xho Xhosa
xsm Kasem
yad Yagua
yao Yao
yap Yapese
ydd Yiddish, Eastern
ykg Yukaghir, Northern
yor Yoruba
yrk Nenets
yua Maya, Yucatán
zam Zapotec, Miahuatlán
zdj Comorian, Ngazidja
zgh Tamazight, Standard Morocan
zro Záparo
ztu Zapotec, Güilá
zul Zulu

Types

This package is fully typed with TypeScript. It exports no additional types.

Compatibility

This package is at least compatible with all maintained versions of Node.js. As of now, that is Node.js 14.14+ and 16.0+. It also works in Deno and modern browsers.

Contribute

Yes please! See How to Contribute to Open Source.

Security

This package is safe.

License

MIT © Titus Wormer