Skip to content

Latest commit

 

History

History
48 lines (37 loc) · 2.61 KB

Standards used in Language Technology.md

File metadata and controls

48 lines (37 loc) · 2.61 KB

Standards used in Language Technology and Lingusitics

Language related ISO standards

Language and Language Family Identification

  • ISO 639-1

  • ISO 639-2

  • ISO 639-3

  • ISO 639-4

  • ISO 639-5

  • ISO 639-6

  • Language tags as defined by the Internet Engineering Task Force (IETF)

  • BCP 47: Best Current Practice 47, which includes RFC 5646

  • RFC 5646, which superseded RFC 4646, which superseded RFC 3066. (Therefore all standards which depend on any of these 3 IETF standards now use ISO 639-3.)

Character Encoding

  • Unicode
    • UTF-8
    • UTF-16

Script Identification Standards

Metadata Standards

i18n / Locale data

  • Unicode's CLDR (Common locale data repository): Uses several hundred codes from ISO 639-3 not included in ISO 639-2.

Text Markup Formats

Documents

  • HTML5: via IETF's BCP 47.
  • Text Encoding Initiative TEI via IETF's BCP 47.

Corpora

Lexicons

  • Lexical Markup Framework: ISO specification for representation of machine-readable dictionaries.