normalize_ligature not having the rigth format #74

m-mehdi-git · 2023-11-27T15:18:22Z

i'm trying the exemple below but i'm getting the same result as the input text

from pyarabic.araby import normalize_ligature
text = u"لانها لالء الاسلام"
normalize_ligature(text)

i'm getting output : لانها لالء الاسلام instead of "لانها لالئ الاسلام"

And thanks for your help - very helpfull library

linuxscout · 2023-11-27T16:46:33Z

Thank you for your comment,
It's just a typo error,
the output is "لانها لالء الاسلام",
I fixed it in documentation.
Thanks

m-mehdi-git · 2023-11-27T17:17:02Z

Thank you for your response, but I didn't really understand the function's role. In the documentation, it is stated as 'Normalize Lam Alef ligatures into two letters.' Does this mean it is supposed to separate them?
I've tried this.
text = u"جاء سؤال الأئمة عن الإسلام آجلا"
test1 = normalize_ligature(text)
output = 'جاء سؤال الأئمة عن الإسلام آجلا'

it seams that the input and output are always the same.

linuxscout · 2023-11-27T17:22:07Z

Hello,

It's important to note that this function addresses the encoding of ligatures of Lam Alif in certain contexts and software. In these cases, Lam Alif ligatures may be represented as a single character, potentially causing confusion during word processing. The function is designed to convert such ligatures, defined by char codes like:

# Ligatures
LAM_ALEF = u'\ufefb'
LAM_ALEF_HAMZA_ABOVE = u'\ufef7'
LAM_ALEF_HAMZA_BELOW = u'\ufef9'
LAM_ALEF_MADDA_ABOVE = u'\ufef5'

into two separate letters, Lam and Alif, represented by char codes like:

"""
SIMPLE_LAM_ALEF = u'\u0644\u0627'
SIMPLE_LAM_ALEF_HAMZA_ABOVE = u'\u0644\u0623'
SIMPLE_LAM_ALEF_HAMZA_BELOW = u'\u0644\u0625'
SIMPLE_LAM_ALEF_MADDA_ABOVE = u'\u0644\u0622'
"""

This conversion ensures proper handling of Lam Alif ligatures in contexts where individual letters are required.

m-mehdi-git · 2023-11-27T18:39:18Z

I see. Just Perfect.
i did a small script to see the differences

line = u'\ufefb'
bytes_data = line.encode("utf-8",errors="strinct")
unicode_string = "u'" + ''.join([f'\\u{ord(byte):04x}' for byte in bytes_data.decode('utf-8')])+"'"
normalized= normalize_ligature(line)
normalized_bytes_data = normalized.encode("utf-8",errors="strinct")
normalized_unicode_string= "u'" + ''.join([f'\\u{ord(byte):04x}' for byte in normalized_bytes_data.decode('utf-8')])+"'"

print ('original: ', line)
print ('bytes: ', bytes_data)
print ('unicode: ', unicode_string)
print('normalized: ', normalized)
print ('new bytes: ',normalized_bytes_data )
print ('new unicode: ',normalized_unicode_string)

output :

original: ﻻ
bytes:  b'\xef\xbb\xbb'
unicode:  u'\ufefb'
normalized:  لا
new bytes: b'\xd9\x84\xd8\xa7'
new unicode: u'\u0644\u0627'

my question is are these the only ligatures or i can add on my own ?
exemple
u'\ufefc' : "ﻼ"
or any others

linuxscout · 2023-11-27T21:45:21Z

my question is are these the only ligatures or i can add on my own ? exemple u'\ufefc' : "ﻼ" or any others
There are lany ligarture, but in some software or tools like Gnome/Linux Lam Alif are represented in single char,
other ligatures are not used in recent texts, they had been used for legacy with old encoding systems.

m-mehdi-git · 2023-11-28T00:08:45Z

i'm using pypfd to extract arabic text and there are some ligatures that are nor managed very well as :

اﻹعﻼنات
لﻸطباء
مسجﻼ

so i'm trying to find a way to add them in the LIGUATURES whithout touching the library. is there a way to extend the list of the constants
Thanks a lot for your time

otakar-smrz · 2023-11-28T08:31:27Z

Hi, let me suggest the https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize implementation to solve the problem in general. Some Arabic-specific functionality seems to be provided by https://camel-tools.readthedocs.io/en/latest/api/utils/normalize.html on top of that.

linuxscout · 2023-11-28T13:46:54Z

@otakar-smrz
Thank you.

otakar-smrz · 2023-11-28T16:01:20Z

The ligatures actually need the NFKC or NFKD normalization mode to be broken down to the standard letters:
https://icu4c-demos.unicode.org/icu-bin/nbrowser?t=%D8%A7%EF%BB%B9%D8%B9%EF%BB%BC%D9%86%D8%A7%D8%AA+%D9%84%EF%BB%B8%D8%B7%D8%A8%D8%A7%D8%A1+%D9%85%D8%B3%D8%AC%EF%BB%BC

m-mehdi-git · 2023-11-30T17:46:54Z

@otakar-smrz Thank you.
@linuxscout. I am grateful for your excellent library.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normalize_ligature not having the rigth format #74

normalize_ligature not having the rigth format #74

m-mehdi-git commented Nov 27, 2023 •

edited

Loading

linuxscout commented Nov 27, 2023

m-mehdi-git commented Nov 27, 2023

linuxscout commented Nov 27, 2023

m-mehdi-git commented Nov 27, 2023

linuxscout commented Nov 27, 2023

m-mehdi-git commented Nov 28, 2023 •

edited

Loading

otakar-smrz commented Nov 28, 2023

linuxscout commented Nov 28, 2023

otakar-smrz commented Nov 28, 2023

m-mehdi-git commented Nov 30, 2023

normalize_ligature not having the rigth format #74

normalize_ligature not having the rigth format #74

Comments

m-mehdi-git commented Nov 27, 2023 • edited Loading

linuxscout commented Nov 27, 2023

m-mehdi-git commented Nov 27, 2023

linuxscout commented Nov 27, 2023

m-mehdi-git commented Nov 27, 2023

linuxscout commented Nov 27, 2023

m-mehdi-git commented Nov 28, 2023 • edited Loading

otakar-smrz commented Nov 28, 2023

linuxscout commented Nov 28, 2023

otakar-smrz commented Nov 28, 2023

m-mehdi-git commented Nov 30, 2023

m-mehdi-git commented Nov 27, 2023 •

edited

Loading

m-mehdi-git commented Nov 28, 2023 •

edited

Loading