Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ISO 15924 codes to orthographies #171

Open
justinpenner opened this issue May 21, 2024 · 5 comments
Open

Add ISO 15924 codes to orthographies #171

justinpenner opened this issue May 21, 2024 · 5 comments
Assignees
Labels
data Issues in the language data help wanted Extra attention is needed needs more information
Milestone

Comments

@justinpenner
Copy link
Contributor

I was hoping the script names in each orthography would be machine-readable, as README_database.md mentions that script names should follow ISO 15924. However, when I match them up with the ISO standard, the following don't match:

Ancient South Arabian
Bamun
Bengali
Burmese
Chinese
Coptic/Nubian
Cree
Devanagari
Egyptian Hieroglyphs
Ge'ez
Ge'ez/Fidel
Georgian
Hangul
Hanja
Hanunoo
Inuktitut Syllabics
Kanji
Modern Yi
Mon-Burmese
N'Ko
Ojibwe Syllabics
Oriya
Sumero-Akkadian Cuneiforms
Tai viet
Tham
Tifinagh

I'm not sure the best way to solve this within Hyperglot, but it would be nice if the API exposed ISO codes for all scripts, or standard names to allow ISO codes to be looked up easily.

Since some of the non-ISO names listed above contain clarifying information (Ge'ez/Fidel, Inuktitut, Ojibwe, Modern Yi) that doesn't exist in the ISO names, maybe some could be moved to a script_preferred_name field, similar to what is already done for language names? Others that almost match the ISO (Devanagari, Hangul, Ge'ez, etc) could be corrected to match ISO 15924.

@kontur
Copy link
Contributor

kontur commented May 28, 2024

Thanks, this is a good issue. I've actually ran into this when making programmatic comparisons to UDHR/CLDR/gflangs data.

Basically Hyperglot has taken some liberties, but from your list shows the need for some kind of reverse matching back to 15924, because this would increase the usefulness of the database. Also looking at this list of ISO and unicode script names often the unicode name seems less convoluted and there are cases where Hyperglot uses that.

Using your compiled list, I've gone through and categorized the differences to ISO 15924 like so:

Hyperglot: 15924

# "Simpler" name chosen in Hyperglot
# These match more or less 1:1 to 15924, but with a less specificity in the name. It would be trivial to create a mapping:
- Bengali: Bengali (Bangla)
- Devanagari: Devanagari (Nagari)
- Georgian: Khutsuri (Asomtavruli and Nuskhuri)
- Hangul: Hangul (Hangŭl, Hangeul)
- Hanunoo: Hanunoo (Hanunóo)
- Oriya: Oriya (Odia)
- Tham: Tai Tham (Lanna)
- Tifinagh: Tifinagh (Berber)

# Different name chosen in Hyperglot
# These seem like deliberate choices in Hyperglot, but can be mapped 1:1 for a mapping.
- Burmese: Myanmar (Burmese)
- Cree: Unified Canadian Aboriginal Syllabics
- Inuktitut Syllabics: Unified Canadian Aboriginal Syllabics
- Modern Yi: Yi — _This seems like a distinction to Traditional Yi which 15924 doesn't make_
- Mon-Burmese: ? — _Not sure if this is included in Myanmar (Burmese) in 15924 or if the two can be equated, in which case Hyperglot should do so as well_
- Ojibwe Syllabics: Unified Canadian Aboriginal Syllabics (?) — _Also see https://github.com/rosettatype/hyperglot/issues/150 re Ojibwe/Ojibwa disambiguation_
- Sumero-Akkadian Cuneiforms: Cuneiform, Sumero-Akkadian — _one way or the other, could be simply Cuneiform, imo_
- Ancient South Arabian: Old South Arabian — _seems a trivial difference, could be corrected_

# Unify/Correct
# These I think we simply could fix in Hyperglot.
- Ge'ez: _use also where Ge'ez/Fidel is used_
- Bamun: _use Bamum spelling instead, seems more prevailant_
- Coptic/Nubian: Coptic — _I think we could use Coptic for simplicity_

# Spelling
# These map 1:1, just tiny string matching differences
- Egyptian Hieroglyphs: Egyptian hieroglyphs
- N'Ko: N’Ko
- Tai viet: Tai Viet — _To follow other Title Case spellings this should be spelled Tai Vier in Hyperglot_

# Different approach for Chinese 
# These form their own group stemming from what I think was a deliberate disambiguation of "Han" used in 15924; I suppose these could be 1:1 mapped like so:
- Hanja: Han (Hanzi, Kanji, Hanja)
- Chinese: Han (Simplified/Traditional)
- Kanji: Han (Hanzi, Kanji, Hanja)

(Also, I don't know "Unified Canadian Aboriginal Syllabics" too well; Hyperglot seemingly disambiguates these, but I don't know in how far there truly are specific differences for those languages/scripts in terms of characters used etc.)

If we look at that list there really aren't any controversial differences or things that could not be mapped back to ISO. There clearly are a few instances where Hyperglot could just use the 15924 name, but most other differences actually seem to fall into two categories: "simpler"/"user friendly" name, or deliberate choice.

In terms of a practical solution to your use case: I suppose we could maintain a 1:1 mapping in a file, and with a particular flag/parameter the parsed yamls would use 15924 script names instead of the default Hyperglot script names. In fact, I don't even think we need a flag or parameter, at least not for programmatic use of the libary; orthographies could just always include the 15924 mapped script code from this 1:1 mapping file—and I suppose it is programmatic use where this difference in script name/code really is of use.

Or what do you think would be a simple and usable approach? I am reluctant to clutter the yaml files with an additional script_preferred_name everywhere, since that is prone to errors from not being applied globally when authoring a single yaml file for a language.

@kontur
Copy link
Contributor

kontur commented May 30, 2024

@justinpenner I've added some of the script name fixes I proposed above, and implemented a way to map all Hyperglot script names to ISO 15924 in the branch script-names.

You'd get the code like this:

from hyperglot.language import Language
from hyperglot.orthography import Orthography

eng = Language("eng")
ort = eng.get_orthography()
print(Orthography(ort)["script_iso"])

If you check out the branch and test this, does this serve your needs for mapping back to ISO 15924. Happy to hear any suggestions you may have.

@kontur
Copy link
Contributor

kontur commented May 30, 2024

@MrBrezina FYI see the above commits with script name changes; I think those are mostly minor and make sense from what I could gather from Wikipedia/Omniglot. Happy to revise. I think the case we may want to consider some more are the couple of languages ISO has with script "Unified Canadian Aboriginal Syllabics" and Hyperglot has with unique Syllabics for each language.

@justinpenner
Copy link
Contributor Author

Thanks for implementing this so quickly! It works for me. Although I just ran a quick script to list all script names and ISO codes, and noticed a few errors:

Abs Inuktitut Syllabics # ISO should be Cans
Arab Arabic
Armn Armenian
Avst Avestan
Bamu Bamum
Beng Bengali
Cakm Chakma
Cans Cree
Cans Ojibwe Syllabics
Cher Cherokee
Copt Coptic
Cyrl Cyrillic
Deva Devanagari
Egyp Egyptian Hieroglyphs
Ethi Ge'ez # use reverse comma modifier, not vertical single quote
Geok Georgian
Grek Greek
Gujr Gujarati
Guru Gurmukhi
Hang Hangul
Hano Hanunoo
Hans Chinese # Hans and Hant have equal status as Chinese writing systems, so maybe change to Hani which encompasses both?
Hebrew Hebrew # ISO should be Hebr
Hira Hiragana
Hrkt Katakana # ISO should be Kana
Kali Kayah Li
Khmr Khmer
Knda Kannada
Lana Tham
Laoo Lao
Latn Latin
Lina Linear A
Linb Linear B
Mlym Malayalam
Mymr Burmese
Orya Oriya
Sarb Ancient South Arabian
Sinh Sinhala
Syrc Syriac
Taml Tamil
Tavt Tai Viet
Telu Telugu
Tfng Tifinagh
Thaa Thaana
Thai Thai
Tibt Tibetan
Vaii Vai
Xsux Sumero-Akkadian Cuneiforms
Yiii Modern Yi
hani Kanji # ISO should be Hani

And a few more minor comments:

I think the case we may want to consider some more are the couple of languages ISO has with script "Unified Canadian Aboriginal Syllabics" and Hyperglot has with unique Syllabics for each language.

Wikipedia describes UCAS as a "family of writing systems", so I agree with Hyperglot's approach of distinguishing the individual scripts but mapping them all to the same ISO. ISO sometimes groups similar/closely related scripts together. The Yi scripts are another example.

# Unify/Correct
# These I think we simply could fix in Hyperglot.
- Ge'ez: _use also where Ge'ez/Fidel is used_

Unless we're sticking to ASCII for some reason, Geʽez (the language and the script) should always be written with ʽ U+02BD MODIFIER LETTER REVERSED COMMA. The reversed comma modifier is used in the ISO name and on Wikipedia.

@kontur
Copy link
Contributor

kontur commented May 31, 2024

Perfect, thanks for checking, and correcting my typos and reviewing the mappings I made.

I'll have this issue open for further comment and this will be merged into the library with the next update 👍

@kontur kontur added this to the 0.7.0 milestone May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Issues in the language data help wanted Extra attention is needed needs more information
Projects
None yet
Development

No branches or pull requests

2 participants