Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message.Entities returns Length of UTF16 encoded string, not UTF8 supported by Golang #231

Open
Fef0 opened this issue May 3, 2019 · 3 comments
Labels
documentation help wanted Extra attention is needed

Comments

@Fef0
Copy link

Fef0 commented May 3, 2019

How I discovered it

I wanted to get the text + emoji that contained a particular link, but I always got the right Offset with a wrong Length (which is correct for UTF16, but not for my original string in UTF8).
Telegram uses UTF16 encoding for calculating Length and Offset so when just ASCII text is used there are no problems at all, since ASCII always uses 1 byte for each character. Once an Emoji is used, due to emojis different sizes, the calculation starts to be wrong.

How I solved this particular problem

I used the unicode/utf16 library in order to encode the original text, extract the text I wanted and then convert it to a UTF8 string again.

The Code

Given update of Update type, I wanted to extract each text with an embedded link by using Entities attribute.
The original message was "➡️Click Me⬅️ or ➡️Click Me⬅️" with "https://www.example.com/" embedded on both (just as a test).

Not Working Code

Using the following code (not using unicode/utf16):

fmt.Println(*update.ChannelPost.Entities)
for _, e := range *update.ChannelPost.Entities {
	// Get the whole update Text
	str := update.ChannelPost.Text
        // Get the text I need 
        str = str[e.Offset : e.Offset+e.Length]
	fmt.Println(str)
}

Output

[{text_link 0 12 https://www.example.com/ <nil>} {text_link 16 12 https://www.example.com/ <nil>}]
➡️Click 
�️ or ➡�

As you can see the second Emoji of the first element isn't just there, while the second element is just broken.

Working Code

The following is a piece of code that totally works (using unicode/utf16):

fmt.Println(*update.ChannelPost.Entities)
// For each entity
for _, e := range *update.ChannelPost.Entities {
	// Get the whole update Text
	str := update.ChannelPost.Text
	// Encode it into utf16
	utfEncodedString := utf16.Encode([]rune(str))
	// Decode just the piece of string I need
	runeString := utf16.Decode(utfEncodedString[e.Offset : e.Offset+e.Length])
	// Transform []rune into string
	str = string(runeString)
	fmt.Println(str)
}

Output

[{text_link 0 12 https://www.example.com/ <nil>} {text_link 16 12 https://www.example.com/ <nil>}]
➡️Click Me⬅️
➡️Click Me⬅️

Elements are just as they should be.

Conclusion

As you can see the Offset and Length are always the same and are actually correct when using UTF16.
Hope it will help anyone having the same issue!

42wim added a commit to 42wim/matterbridge that referenced this issue Jul 6, 2019
42wim added a commit to 42wim/matterbridge that referenced this issue Jul 8, 2019
zeridon pushed a commit to zeridon/matterbridge that referenced this issue Feb 12, 2020
@Syfaro Syfaro added documentation help wanted Extra attention is needed labels Jul 21, 2020
@Syfaro Syfaro pinned this issue Jul 21, 2020
@shtrih
Copy link

shtrih commented Nov 25, 2021

If you want convert entities []tgbotapi.MessageEntity to Markdown text, here is an example for Telebot library.
And here is an example for the telegram-bot-api that converts entities to Discord markdown (with test).

@kingUFU
Copy link

kingUFU commented Nov 9, 2022

Discord

@Pato05
Copy link

Pato05 commented Jan 29, 2023

The issue is also present internally, on the CommandWithAt() function in types.go:681

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants