stdlib: Improve Unicode support and consistency in string comparison functions.

SDL_strcasecmp (even when calling into a C runtime) does not work with
Unicode chars, and depending on the user's locale, might not work with
even basic ASCII strings.

This implements the function from scratch, using "case-folding,"
which is a more robust method that deals with various languages. It
involves a hashtable of a few hundred codepoints that are "uppercase" and
how to map them to lowercase equivalents (possibly increasing the size of
the string in the process). The vast majority of human languages (and
Unicode) do not have letters with different cases, but still, this static
table takes about 10 kilobytes on a 64-bit machine.

Even this will fail in one known case: the Turkish 'i' folds differently
if you're writing in Turkish vs other languages. Generally this is seen as
unfortunate collateral damage in cases where you can't specify the language
in use.

In addition to case-folding the codepoints, the new functions also know how
to decode the various formats to turn them into codepoints in the first
place, instead of blindly stepping by one byte (or one wchar_t) per
character.

Also included is casefolding.txt from the Unicode Consortium and a perl
script to generate the hashtable from that text file, so we can trivially
update this if new languages are added in the future.

A simple test using the new function:

```c
 #include <SDL3/SDL.h>

 int main(void)
 {
     const char *a = "α ε η";
     const char *b = "Α Ε Η";
     SDL_Log("    strcasecmp(\"%s\", \"%s\") == %d\n", a, b, strcasecmp(a, b));
     SDL_Log("SDL_strcasecmp(\"%s\", \"%s\") == %d\n", a, b, SDL_strcasecmp(a, b));
     return 0;
 }
```

Produces:

```
INFO:     strcasecmp("α ε η", "Α Ε Η") == 32
INFO: SDL_strcasecmp("α ε η", "Α Ε Η") == 0
```

glibc strcasecmp() fails to compare a Greek lowercase string to its uppercase
equivalent, even with a UTF-8 locale, but SDL_strcasecmp() works.

Other SDL_stdinc.h functions are changed to be more consistent, which is to
say they now ignore any C runtime and often dictate that only English-based
low-ASCII works with them.

Fixes Issue #9313.
This commit is contained in:
Ryan C. Gordon
2024-03-26 13:22:38 -04:00
parent 4659a84bd1
commit a5c892d2c3
17 changed files with 4971 additions and 210 deletions

View File

@@ -1401,6 +1401,18 @@ This header has been removed and a simplified version of this API has been added
The standard C headers like stdio.h and stdlib.h are no longer included, you should include them directly in your project if you use non-SDL C runtime functions.
M_PI is no longer defined in SDL_stdinc.h, you can use the new symbols SDL_PI_D (double) and SDL_PI_F (float) instead.
SDL3 attempts to apply consistency to case-insensitive string functions. In SDL2, things like SDL_strcasecmp() would usually only work on English letters, and depending on the user's locale, possibly not even those. In SDL3, consistency is applied:
- Many things that don't care about case-insensitivity, like SDL_strcmp(), continue to work with any null-terminated string of bytes, even if it happens to be malformed UTF-8.
- SDL_strcasecmp() expects valid UTF-8 strings, and will attempt to support _most_ Unicode characters with a technique known as "case-folding," which is to say it can match 'A' and 'a', and also 'Η' and 'η', but ALSO 'ß' and "ss". This is _probably_ how most apps assumed it worked in SDL2 and won't need any changes.
- SDL_strncasecmp() works the same, but the third parameter takes _bytes_, as before, so SDL_strlen() can continue to be used with it. If a string hits the limit in the middle of a codepoint, the half-processed bytes of the codepoint will be treated as a collection of U+0xFFFD (REPLACEMENT CHARACTER) codepoints, which you probably don't want.
- SDL_wcscasecmp() and SDL_wcsncasecmp() work the same way but operate on UTF-16 or UTF-32 encoded strings, depending on what the platform considers "wchar_t" to be. SDL_wcsncasecmp's third parameter is number of wchar_t values, not bytes, but UTF-16 has the same concerns as UTF-8 for variable-length codepoints.
- SDL_strcasestr() expects valid UTF-8 strings, and will compare codepoints using case-folding.
- SDL_tolower() and SDL_toupper() continue to only work on single bytes (even though the parameter is an `int`) and _only_ converts low-ASCII English A through Z.
- SDL_strlwr() and SDL_strupr() operates on individual bytes (not UTF-8 codepoints) and only change low-ASCII English 'A' through 'Z'. These functions do not check the string for valid UTF-8 encoding.
- The ctype.h replacement SDL_is*() functions (SDL_isalpha, SDL_isdigit, etc) only work on low-ASCII characters and ignore user locale, assuming English. This makes these functions consistent in SDL3, but applications need to be careful to understand their limits.
Please note that the case-folding technique used by SDL3 will not produce correct results for the "Turkish 'I'"; this one letter is a surprisingly hard problem in the Unicode world, and since these functions do not specify the human language in use, we have chosen to ignore this problem.
The following functions have been renamed:
* SDL_strtokr() => SDL_strtok_r()