The only encoding which is compatible with "every operating system in the world" is no enforced encoding at all, and you can do very little "string-like" operations with such a type.
Even Python, well-known for being a very usable language, distinguishes between strings (which are unicode, but not utf-8 necessarily) and bytes, which you need to use if you're interacting directly with the OS.
The only real difference between the two is really the looseness with which Python lets you work with them, by virtue of being dynamically typed and having a large standard library that papers over some of the details.
> The only encoding which is compatible with "every operating system in the world" is no enforced encoding at all, and you can do very little "string-like" operations with such a type.
People who like "list of Unicode code points" string types in languages like Rust and Python 3 always say this, but I'm never sure what operations they think are enabled by them.
In the "bag of bytes that's probably UTF-8" world, you can safely concatenate strings, compare them for equality, search for substrings, evaluate regular expressions, and so on. In the "list of code points" world, you can do... what else exactly?
Many things that users think of as single characters are composed of multiple code points, so the "list of code points" representation does not allow you to truncate strings, reverse them, count their length, or do really anything else that involves the user-facing idea of a "character". You can iterate over each of the code points in a string, but... that's almost circular? Maybe the bytes representation is better because it makes it easier to iterate over all the bytes in a string. Neither of those is an especially useful operation on its own.
> In the "bag of bytes that's probably UTF-8" world, you can safely concatenate strings, compare them for equality, search for substrings, evaluate regular expressions,
No you can't (except byte -level-equality). If your regex is "abc" and the last byte of an emoji is the same as 'a' and the emoji is followed by "bc" it does the wrong thing.
The last byte of an emoji is never the same as 'a'. UTF-8 is self-synchronizing, a trailing byte can never be misinterpreted as the start of a new codepoint.
This makes `memmem()` a valid substring search on UTF-8! With most legacy multi-byte encodings this would fail, but with UTF-8 it works!
You still have this problem in the "list of Unicode code points" world, since many multi-code-point emoji sequences appear to users as a single character, but start and end with code points that are valid emojis on their own.
Python 3 believes that the string "[Saint Lucia flag emoji][Andorra flag emoji]" contains a Canada flag emoji.
> People who like "list of Unicode code points" string types in languages like Rust and Python 3
Rust and Python 3 have very different string representations. Rust's String is a bunch of UTF-8 encoded bytes. Python 3's String is a sequence of codepoints, and the size changes based on the encoding.
Yeah, implementation-wise, Rust's version of this idea is a little better, since at least you can convert their strings to a bag-of-bytes/OsStr-style representation almost for free. (And to their credit, Rust's docs for the chars() string method discuss why it's not very useful.)
I do think the basic motivation of Unicode codepoints somehow being a better / more correct way to interact with strings is the same in both languages, though. Certainly a lot of people, including the grandparent comment, defend Rust strings using exactly the same arguments as Python 3 strings.
Even Python, well-known for being a very usable language, distinguishes between strings (which are unicode, but not utf-8 necessarily) and bytes, which you need to use if you're interacting directly with the OS.
The only real difference between the two is really the looseness with which Python lets you work with them, by virtue of being dynamically typed and having a large standard library that papers over some of the details.