make x3 interger parser less dependent on fundamental character types #762

wanghan02 · 2023-05-05T10:49:34Z

Many of the spirit x3 parsers are natualy able to work with user defined char types. It would be very helpful if x3's integer parser could be one of those parsers. And it could be done by more canonical implementation of radix_traits::is_valid and radix_traits::digit.

In radix_traits::is_valid, input character ch of type Char is compared with several ascii characters of type char. The static_cast destination type should still be ascii character type char, not the input character type Char. Comparing input character ch with ascii type char should be the more canonical way to do the range check just like the comparison between ch with other ascii char literals ('0', '9', 'a', 'A'). If we have a user defined character type which doesn't support implicit conversion to/from the promoted integer type, the range check still works!

In the old implementation of radix_traits::digit, the Radix check is not necessary and incomplete.

            return (Radix <= 10 || (ch >= '0' && ch <= '9'))
                ? ch - '0'
                : char_encoding::ascii::tolower(ch) - 'a' + 10;

Since radix_traits::digit is only called after the range check radix_traits::is_valid, we don't have to do any range check at all in radix_traits::digit. The input character ch is guaranteed to be in one of the 3 ranges: '0'-'9', 'a' - 'a' + Radix -10 -1 and 'A' - 'A' + Radix -10 -1. We also don't have to call char_encoding::ascii::tolower(ch) which adds another layer of unnecessary dependency. Given the valid range is right above, a simpler implementation as below is clear and efficient. And it doesn't require implicit conversion to integer type from input character type as well!

            return ch < 'A'
                ? ch - '0'
                : ch < 'a'
                    ? ch - 'A' + 10
                    : ch - 'a' + 10;

Both old and new implementations assume char literal uses ascii table.

Kojoley

user defined char types.

I don't think I ever seen one. It would help me in understanding the intent of the change if you show one.

Please also notice that when a patch doesn't add tests there is no guarantee that your use case will not be broken in a future by some unrelated change.

Kojoley · 2023-05-06T13:15:49Z

include/boost/spirit/home/x3/support/numeric_utils/detail/extract_int.hpp

+ return (ch >= '0' && ch <= (Radix > 10 ? '9' : static_cast<char>('0' + Radix -1)))
+ || (Radix > 10 && ch >= 'a' && ch <= static_cast<char>('a' + Radix -10 -1))
+ || (Radix > 10 && ch >= 'A' && ch <= static_cast<char>('A' + Radix -10 -1));


I agree that at least for consistency it is better, though the original code is relying without a static assert or a comment on a fact that all these constants always are positive numbers and radix addition to a constant would never overflow or will be caught by the compiler.

Kojoley · 2023-05-06T13:16:37Z

include/boost/spirit/home/x3/support/numeric_utils/detail/extract_int.hpp

 }

 template <typename Char>
 inline static unsigned digit(Char ch)
 {
- return (Radix <= 10 || (ch >= '0' && ch <= '9'))


It is a compile time branch, an if constexpr that doesn't produce a range of compiler warnings 26356b5 think-cell@7771d20. You basically replaced a compile time branch with a runtime one.

I'm not sure if this minor optimization is better than simplicity. But If it is really important that the commit won't change the performance in any branch, we could add this compile time optimization back in. Then we could keep this optimization and don't rely on char_encoding::ascii::tolower.

return (Radix <= 10 || ch < 'A') ? ch - '0' : ch < 'a' ? ch - 'A' + 10 : ch - 'a' + 10;

We have a char_ascii type that checks ascii range during construction. It is implicitly convertible and comparable to fundamental char types but not integer types.

IMO the two changes (with Radix constexpr optimization for the 2nd one?) are better implementation than the original code. They provide better semantics and less dependency. If you @Kojoley agree with both changes I could add a simplified char_ascii type and test cases to cover this change.

Does your 'user defined char type' support bitwise operators? Will replacing char_encoding::ascii::tolowe with (ch | 0x20) - 'a' + 10 work for you?

For fundamental char types, integral promotion is performed before doing bitwise operations. Basically it is an integer operation and relies on implicit conversion to integer type. For our char_ascii type, we focus more on character side of the type and don't support any integer operations other than getting the distance between 2 char_asciis.

But I do see there are some performance improvement with (ch | 0x20) - 'a' + 10 than ternary operator in case of fundamental character types. Can we take std::is_convertible<Char const&, int>::value into the path to make user-defined char types similar to our char_ascii work without downgrade the performance of fundamental char types?

return (Radix <= 10 || ch < 'A') ? ch - '0' : std::is_convertible<Char const&, int>::value ? (ch | 0x20) - 'a' + 10 : ch < 'a' ? ch - 'A' + 10 : ch - 'a' + 10;

The performance is the same as below with fundamental char types.

return (Radix <= 10 || ch < 'A') ? ch - '0' : (ch | 0x20) - 'a' + 10;

I think your char type design is just broken. Is there tolower associated with it?

Can we take std::is_convertible<Char const&, int>::value into the path to make user-defined char types similar to our char_ascii work without downgrade the performance of fundamental char types?

I don't want to add a code path that literally nobody else is using and also is not tested by the test suite.

We simply don't use tolower. It doesn't matter how we design our character type. The user has freedom to define any character type. It's just if some day the community wants to add constraints to every spirit x3 parser, what is the minimum request on the Char type for this parser? I don't see depending on tolower or implicit cast to int is necessary in this function performance-wise and semantics-wise. We can split the function into overloads for better semantics. The Radix<=10 compile time optimization mixed in the code path is already weird.

template <typename Char, unsigned Radix_=Radix, std::enable_if_t<Radix_<=10>* = nullptr> inline static unsigned digit(Char ch) { return ch - '0'; } template <typename Char, unsigned Radix_=Radix, std::enable_if_t<10<Radix_ && std::is_convertible<Char, int>::value>* = nullptr> inline static unsigned digit(Char ch) { return ch < 'A' ? ch - '0' : (ch | 0x20) - 'a' + 10; } template <typename Char, unsigned Radix_=Radix, std::enable_if_t<10<Radix_ && !std::is_convertible<Char, int>::value>* = nullptr> inline static unsigned digit(Char ch) { return ch < 'A' ? ch - '0' : ch < 'a' ? ch - 'A' + 10 : ch - 'a' + 10; }

You can take this pull request as a refactoring suggestion. The first change static_cast<Char> -> static_cast<char> for me definitely makes the code cleaner. The second change makes digit more aligned with the implementation of is_valid which doesn't use tolower either. Since digit is only called if is_valid is true, the implementation of these 2 functions should be aligned with each other. But it's only a refactoring suggestion. As I said if you agree, I could add test cases to cover the change. Otherwise we have to specialize this struct to workaround the dependency.

less dependent on fundamental character types

a333e74

Kojoley reviewed May 6, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make x3 interger parser less dependent on fundamental character types #762

make x3 interger parser less dependent on fundamental character types #762

wanghan02 commented May 5, 2023 •

edited

Kojoley left a comment

Kojoley May 6, 2023

Kojoley May 6, 2023

wanghan02 May 8, 2023

wanghan02 May 8, 2023

Kojoley May 8, 2023

wanghan02 May 9, 2023 •

edited

Kojoley May 9, 2023

wanghan02 May 10, 2023 •

edited

make x3 interger parser less dependent on fundamental character types #762

Are you sure you want to change the base?

make x3 interger parser less dependent on fundamental character types #762

Conversation

wanghan02 commented May 5, 2023 • edited

Kojoley left a comment

Choose a reason for hiding this comment

Kojoley May 6, 2023

Choose a reason for hiding this comment

Kojoley May 6, 2023

Choose a reason for hiding this comment

wanghan02 May 8, 2023

Choose a reason for hiding this comment

wanghan02 May 8, 2023

Choose a reason for hiding this comment

Kojoley May 8, 2023

Choose a reason for hiding this comment

wanghan02 May 9, 2023 • edited

Choose a reason for hiding this comment

Kojoley May 9, 2023

Choose a reason for hiding this comment

wanghan02 May 10, 2023 • edited

Choose a reason for hiding this comment

wanghan02 commented May 5, 2023 •

edited

wanghan02 May 9, 2023 •

edited

wanghan02 May 10, 2023 •

edited