Working with strings is one of the most common complaints about rust though. Unl...

klodolph · on Nov 27, 2023

Would love to hear what you think the complaints come from. They seem fine to me, and I have voiced plenty of criticism about the other parts of Rust. They work more or less how I expect—you have an array of bytes, which can either be a reference (&str) or owned / mutable / growable (String).

The only unusual thing about Rust is that it validates that the bytes are UTF-8.

tayo42 · on Nov 27, 2023

Mostly around usability and learning curve. I wasnt sure if the post was meant as a total endorsement of rusts strings or just the encoding aspect of them

dralley · on Nov 27, 2023

What people complain about with Rust strings are that there are so many different types, like &str vs String, and OsString / OsStr. The encoding of the strings isn't the issue.

csande17 · on Nov 27, 2023

Encoding might not be the whole issue, but "Rust mandates that the 'string' type must only contain valid UTF-8, which is incompatible with every operating system in the world" is the reason why OsString is a separate type.

dralley · on Nov 27, 2023

The only encoding which is compatible with "every operating system in the world" is no enforced encoding at all, and you can do very little "string-like" operations with such a type.

Even Python, well-known for being a very usable language, distinguishes between strings (which are unicode, but not utf-8 necessarily) and bytes, which you need to use if you're interacting directly with the OS.

The only real difference between the two is really the looseness with which Python lets you work with them, by virtue of being dynamically typed and having a large standard library that papers over some of the details.

csande17 · on Nov 27, 2023

> The only encoding which is compatible with "every operating system in the world" is no enforced encoding at all, and you can do very little "string-like" operations with such a type.

People who like "list of Unicode code points" string types in languages like Rust and Python 3 always say this, but I'm never sure what operations they think are enabled by them.

In the "bag of bytes that's probably UTF-8" world, you can safely concatenate strings, compare them for equality, search for substrings, evaluate regular expressions, and so on. In the "list of code points" world, you can do... what else exactly?

Many things that users think of as single characters are composed of multiple code points, so the "list of code points" representation does not allow you to truncate strings, reverse them, count their length, or do really anything else that involves the user-facing idea of a "character". You can iterate over each of the code points in a string, but... that's almost circular? Maybe the bytes representation is better because it makes it easier to iterate over all the bytes in a string. Neither of those is an especially useful operation on its own.

coryrc · on Nov 27, 2023

> In the "bag of bytes that's probably UTF-8" world, you can safely concatenate strings, compare them for equality, search for substrings, evaluate regular expressions,

No you can't (except byte -level-equality). If your regex is "abc" and the last byte of an emoji is the same as 'a' and the emoji is followed by "bc" it does the wrong thing.

ynik · on Nov 27, 2023

You can.

The last byte of an emoji is never the same as 'a'. UTF-8 is self-synchronizing, a trailing byte can never be misinterpreted as the start of a new codepoint.

This makes `memmem()` a valid substring search on UTF-8! With most legacy multi-byte encodings this would fail, but with UTF-8 it works!

pezezin · on Nov 28, 2023

Assuming that your strings are normalized, otherwise precomposed characters will not match decomposed characters.

csande17 · on Nov 27, 2023

You still have this problem in the "list of Unicode code points" world, since many multi-code-point emoji sequences appear to users as a single character, but start and end with code points that are valid emojis on their own.

Python 3 believes that the string "[Saint Lucia flag emoji][Andorra flag emoji]" contains a Canada flag emoji.

steveklabnik · on Nov 27, 2023

> People who like "list of Unicode code points" string types in languages like Rust and Python 3

Rust and Python 3 have very different string representations. Rust's String is a bunch of UTF-8 encoded bytes. Python 3's String is a sequence of codepoints, and the size changes based on the encoding.

csande17 · on Nov 27, 2023

Yeah, implementation-wise, Rust's version of this idea is a little better, since at least you can convert their strings to a bag-of-bytes/OsStr-style representation almost for free. (And to their credit, Rust's docs for the chars() string method discuss why it's not very useful.)

I do think the basic motivation of Unicode codepoints somehow being a better / more correct way to interact with strings is the same in both languages, though. Certainly a lot of people, including the grandparent comment, defend Rust strings using exactly the same arguments as Python 3 strings.

smallstepforman · on Nov 27, 2023

Quote: which is incompatible with every operating system in the world

Should be some, not every, since there are OS’s where string types are utf-8, eg BeOS and Haiku