The only encoding which is compatible with "every operating system in the world"...

csande17 · on Nov 27, 2023

> The only encoding which is compatible with "every operating system in the world" is no enforced encoding at all, and you can do very little "string-like" operations with such a type.

People who like "list of Unicode code points" string types in languages like Rust and Python 3 always say this, but I'm never sure what operations they think are enabled by them.

In the "bag of bytes that's probably UTF-8" world, you can safely concatenate strings, compare them for equality, search for substrings, evaluate regular expressions, and so on. In the "list of code points" world, you can do... what else exactly?

Many things that users think of as single characters are composed of multiple code points, so the "list of code points" representation does not allow you to truncate strings, reverse them, count their length, or do really anything else that involves the user-facing idea of a "character". You can iterate over each of the code points in a string, but... that's almost circular? Maybe the bytes representation is better because it makes it easier to iterate over all the bytes in a string. Neither of those is an especially useful operation on its own.

coryrc · on Nov 27, 2023

> In the "bag of bytes that's probably UTF-8" world, you can safely concatenate strings, compare them for equality, search for substrings, evaluate regular expressions,

No you can't (except byte -level-equality). If your regex is "abc" and the last byte of an emoji is the same as 'a' and the emoji is followed by "bc" it does the wrong thing.

ynik · on Nov 27, 2023

You can.

The last byte of an emoji is never the same as 'a'. UTF-8 is self-synchronizing, a trailing byte can never be misinterpreted as the start of a new codepoint.

This makes `memmem()` a valid substring search on UTF-8! With most legacy multi-byte encodings this would fail, but with UTF-8 it works!

pezezin · on Nov 28, 2023

Assuming that your strings are normalized, otherwise precomposed characters will not match decomposed characters.

csande17 · on Nov 27, 2023

You still have this problem in the "list of Unicode code points" world, since many multi-code-point emoji sequences appear to users as a single character, but start and end with code points that are valid emojis on their own.

Python 3 believes that the string "[Saint Lucia flag emoji][Andorra flag emoji]" contains a Canada flag emoji.

steveklabnik · on Nov 27, 2023

> People who like "list of Unicode code points" string types in languages like Rust and Python 3

Rust and Python 3 have very different string representations. Rust's String is a bunch of UTF-8 encoded bytes. Python 3's String is a sequence of codepoints, and the size changes based on the encoding.

csande17 · on Nov 27, 2023

Yeah, implementation-wise, Rust's version of this idea is a little better, since at least you can convert their strings to a bag-of-bytes/OsStr-style representation almost for free. (And to their credit, Rust's docs for the chars() string method discuss why it's not very useful.)

I do think the basic motivation of Unicode codepoints somehow being a better / more correct way to interact with strings is the same in both languages, though. Certainly a lot of people, including the grandparent comment, defend Rust strings using exactly the same arguments as Python 3 strings.