Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul wide-string support #942

Open
trueqbit opened this issue Mar 8, 2022 · 9 comments
Open

Overhaul wide-string support #942

trueqbit opened this issue Mar 8, 2022 · 9 comments

Comments

@trueqbit
Copy link
Collaborator

trueqbit commented Mar 8, 2022

As already mentioned in Slack, the support for wide-character strings needs a rather complete overhaul and/or an explicit documentation of its features. As far as I understand it is broken on macos/Linux.

  • First of all, usage of sqlite_*_text16() vs. wstring_convert<codecvt_utf8_utf16<wchar_t>> is kind of intermixed.
  • There are only unit tests for a single code path: binding to a statement and extracting from a result set via codecvt_utf8_utf16<wchar_t>, but neither conversion from a column value nor calling a function or returning from it. [see test case]
  • Returning a string from a function is broken [statement_binder<>::result()]
    • sqlite3_result_text16() expects the number of bytes, not characters. [see 3rd parameter]
    • sqlite3_result_text16() should be instructed to copy the string using SQLITE_TRANSIENT), otherwise the resulting memory goes out of scope. [see 4th parameter]
  • Expecting UTF-16 encoded strings is correct on Windows, but not on other operating systems like macos/Linux:
    • On Windows, everything's working fine: sizeof(wchar_t) == 2 (16-bit), and encoding is UTF-16.
    • On macos/linux: sizeof(wchar_t) == 4 (32-bit):
      • Using sqlite3_*_16() functions is outrightly wrong.
      • Using codecvt_utf8_utf16<> is bad:
        • While it's not prohibited to use wchar_t for UTF-16, it easily leads to subtly unexpected behaviour: Because wchar_t is 32-bit, it usually carries UTF-32, not UTF-16.
          • Passed UTF-32 strings are suddenly treated as UTF-16 by sqlite_orm/sqlite.
          • Returned UTF-16 strings are suddenly treated as UTF-32 by the program.
          • In any case, [codecvt_utf8_utf16<>](https://en.cppreference.com/w/cpp/locale/codecvt_utf8_utf16) expects UTF-16, no matter the sizeof wchar_t: "If Elem is a 32-bit type, one UTF-16 code unit will be stored in each 32-bit character of the output sequence.". I emphasize again that this isn't the regular expectation on macos/Linux.
    • While we are at it, I'd like to see a separation of wide-string support from SQLITE_ORM_OMITS_CODECVT, if possible: One might want to be able to pass or return wide-strings from Windows API functions, even if not being able to serialize the statement.

One way of fixing the UTF-16 issue on macos/Linux quickly is by disabling UTF-16 unicode when not on Windows altogether. This might not even have any impact, given that UTF-8 is prevalent on those systems.

@fnc12
Copy link
Owner

fnc12 commented Mar 8, 2022

Yeah what you are saying is right. The only disadvantage related to your work was that we've broken UT in dev by my oversight.

@trueqbit
Copy link
Collaborator Author

trueqbit commented Mar 8, 2022

No worries, when I touch code, I always find those corner cases, no matter the library or framework :)

@fnc12
Copy link
Owner

fnc12 commented Mar 8, 2022

@trueqbit you're good!

@fnc12
Copy link
Owner

fnc12 commented Mar 27, 2022

@trueqbit can we close it?

@trueqbit
Copy link
Collaborator Author

No! Still needs to be addressed.

@fnc12
Copy link
Owner

fnc12 commented Mar 27, 2022

looks like we are facing universal text encoding issue. There is no universal text encoding in standard C++ library so we are unable to fix it now. But when universal encoding appears in standard C++ lib then we can add it. Also users can add customizations for wstring using external libraries within their projects right now

@trueqbit
Copy link
Collaborator Author

What if we enable wchar_t support only on Windows (which is UTF-16)? This is incredibly useful on Windows and I wouldn't be happy if it went away. SQLite itself supports UTF-16, and we could use SQLite UTF-16 APIs directly.
UTF-16 on other platforms is a delicate subject and would only be doable with a sane opt-in and only worth it if someone absolutely requests it.

Otherwise there are universal C++ conversion functions from UTF-32 [std::convert_utf8<char32_t>] and there is even a native UTF-32 character type char32_t. However, I would assign UTF-32 support to a different ticket.

@fnc12
Copy link
Owner

fnc12 commented Mar 27, 2022

  1. convert_utf8 is deprecated in C++17. That is why it is used in sqlite_orm under dedicated compile flag
  2. how do you know that Windows uses wchar_t for UTF-16? Is it just a practical finding or there is a doc with it?

@trueqbit
Copy link
Collaborator Author

  1. Oh yes you are right, I thought that only UTF-16 conversion using codecvt_utf8_utf16 is deprecated.
  2. Programming on Windows is not undefined behaviour 🤣 UTF-16 is used since Windows 2000 as a super-set of UCS-2 (see wikipedia article), the whole Win32 API is using UTF-16 as per Microsoft's documentation (see article about Unicode), and based on wchar_t.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants