-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ISO 15924 codes to orthographies #171
Comments
Thanks, this is a good issue. I've actually ran into this when making programmatic comparisons to UDHR/CLDR/gflangs data. Basically Hyperglot has taken some liberties, but from your list shows the need for some kind of reverse matching back to 15924, because this would increase the usefulness of the database. Also looking at this list of ISO and unicode script names often the unicode name seems less convoluted and there are cases where Hyperglot uses that. Using your compiled list, I've gone through and categorized the differences to ISO 15924 like so:
(Also, I don't know "Unified Canadian Aboriginal Syllabics" too well; Hyperglot seemingly disambiguates these, but I don't know in how far there truly are specific differences for those languages/scripts in terms of characters used etc.) If we look at that list there really aren't any controversial differences or things that could not be mapped back to ISO. There clearly are a few instances where Hyperglot could just use the 15924 name, but most other differences actually seem to fall into two categories: "simpler"/"user friendly" name, or deliberate choice. In terms of a practical solution to your use case: I suppose we could maintain a 1:1 mapping in a file, and with a particular flag/parameter the parsed yamls would use 15924 script names instead of the default Hyperglot script names. In fact, I don't even think we need a flag or parameter, at least not for programmatic use of the libary; orthographies could just always include the 15924 mapped script code from this 1:1 mapping file—and I suppose it is programmatic use where this difference in script name/code really is of use. Or what do you think would be a simple and usable approach? I am reluctant to clutter the yaml files with an additional |
@justinpenner I've added some of the script name fixes I proposed above, and implemented a way to map all Hyperglot script names to ISO 15924 in the branch You'd get the code like this:
If you check out the branch and test this, does this serve your needs for mapping back to ISO 15924. Happy to hear any suggestions you may have. |
@MrBrezina FYI see the above commits with script name changes; I think those are mostly minor and make sense from what I could gather from Wikipedia/Omniglot. Happy to revise. I think the case we may want to consider some more are the couple of languages ISO has with script "Unified Canadian Aboriginal Syllabics" and Hyperglot has with unique Syllabics for each language. |
Thanks for implementing this so quickly! It works for me. Although I just ran a quick script to list all script names and ISO codes, and noticed a few errors: Abs Inuktitut Syllabics # ISO should be Cans
Arab Arabic
Armn Armenian
Avst Avestan
Bamu Bamum
Beng Bengali
Cakm Chakma
Cans Cree
Cans Ojibwe Syllabics
Cher Cherokee
Copt Coptic
Cyrl Cyrillic
Deva Devanagari
Egyp Egyptian Hieroglyphs
Ethi Ge'ez # use reverse comma modifier, not vertical single quote
Geok Georgian
Grek Greek
Gujr Gujarati
Guru Gurmukhi
Hang Hangul
Hano Hanunoo
Hans Chinese # Hans and Hant have equal status as Chinese writing systems, so maybe change to Hani which encompasses both?
Hebrew Hebrew # ISO should be Hebr
Hira Hiragana
Hrkt Katakana # ISO should be Kana
Kali Kayah Li
Khmr Khmer
Knda Kannada
Lana Tham
Laoo Lao
Latn Latin
Lina Linear A
Linb Linear B
Mlym Malayalam
Mymr Burmese
Orya Oriya
Sarb Ancient South Arabian
Sinh Sinhala
Syrc Syriac
Taml Tamil
Tavt Tai Viet
Telu Telugu
Tfng Tifinagh
Thaa Thaana
Thai Thai
Tibt Tibetan
Vaii Vai
Xsux Sumero-Akkadian Cuneiforms
Yiii Modern Yi
hani Kanji # ISO should be Hani And a few more minor comments:
Wikipedia describes UCAS as a "family of writing systems", so I agree with Hyperglot's approach of distinguishing the individual scripts but mapping them all to the same ISO. ISO sometimes groups similar/closely related scripts together. The Yi scripts are another example.
Unless we're sticking to ASCII for some reason, Geʽez (the language and the script) should always be written with |
Perfect, thanks for checking, and correcting my typos and reviewing the mappings I made. I'll have this issue open for further comment and this will be merged into the library with the next update 👍 |
I was hoping the script names in each orthography would be machine-readable, as README_database.md mentions that script names should follow ISO 15924. However, when I match them up with the ISO standard, the following don't match:
I'm not sure the best way to solve this within Hyperglot, but it would be nice if the API exposed ISO codes for all scripts, or standard names to allow ISO codes to be looked up easily.
Since some of the non-ISO names listed above contain clarifying information (Ge'ez/Fidel, Inuktitut, Ojibwe, Modern Yi) that doesn't exist in the ISO names, maybe some could be moved to a
script_preferred_name
field, similar to what is already done for language names? Others that almost match the ISO (Devanagari, Hangul, Ge'ez, etc) could be corrected to match ISO 15924.The text was updated successfully, but these errors were encountered: