I think the main alternative design is to treat strings like in Rust or Go.
The problem with the “array of code points” idea is that you end up with the most general implementation, which is a UTF-32 string, and then you end up with the fastest implementation, which is a UTF-8 string, and maybe throw in UCS-2 for good measure. These all have the same asymptotic performance characteristics, but allow ASCII strings (which are extremely common) to be stored with less memory. The cost is that now you have two or three different string representations floating around. This approach is used by Python and Java, for example.
The Rust / Go approach is to assume that you don’t need O(1) access to the Nth code point in a string, which is probably reasonable, since that’s rarely necessary or even useful. You get a lot of complexity savings from only using one encoding, and the main tradeoff is that certain languages take 50% more space in memory.
Python and Java both date back to an era where fixed-width string encodings were the norm.
Would love to hear what you think the complaints come from. They seem fine to me, and I have voiced plenty of criticism about the other parts of Rust. They work more or less how I expect—you have an array of bytes, which can either be a reference (&str) or owned / mutable / growable (String).
The only unusual thing about Rust is that it validates that the bytes are UTF-8.
Mostly around usability and learning curve. I wasnt sure if the post was meant as a total endorsement of rusts strings or just the encoding aspect of them
What people complain about with Rust strings are that there are so many different types, like &str vs String, and OsString / OsStr. The encoding of the strings isn't the issue.
Encoding might not be the whole issue, but "Rust mandates that the 'string' type must only contain valid UTF-8, which is incompatible with every operating system in the world" is the reason why OsString is a separate type.
The only encoding which is compatible with "every operating system in the world" is no enforced encoding at all, and you can do very little "string-like" operations with such a type.
Even Python, well-known for being a very usable language, distinguishes between strings (which are unicode, but not utf-8 necessarily) and bytes, which you need to use if you're interacting directly with the OS.
The only real difference between the two is really the looseness with which Python lets you work with them, by virtue of being dynamically typed and having a large standard library that papers over some of the details.
> The only encoding which is compatible with "every operating system in the world" is no enforced encoding at all, and you can do very little "string-like" operations with such a type.
People who like "list of Unicode code points" string types in languages like Rust and Python 3 always say this, but I'm never sure what operations they think are enabled by them.
In the "bag of bytes that's probably UTF-8" world, you can safely concatenate strings, compare them for equality, search for substrings, evaluate regular expressions, and so on. In the "list of code points" world, you can do... what else exactly?
Many things that users think of as single characters are composed of multiple code points, so the "list of code points" representation does not allow you to truncate strings, reverse them, count their length, or do really anything else that involves the user-facing idea of a "character". You can iterate over each of the code points in a string, but... that's almost circular? Maybe the bytes representation is better because it makes it easier to iterate over all the bytes in a string. Neither of those is an especially useful operation on its own.
> In the "bag of bytes that's probably UTF-8" world, you can safely concatenate strings, compare them for equality, search for substrings, evaluate regular expressions,
No you can't (except byte -level-equality). If your regex is "abc" and the last byte of an emoji is the same as 'a' and the emoji is followed by "bc" it does the wrong thing.
The last byte of an emoji is never the same as 'a'. UTF-8 is self-synchronizing, a trailing byte can never be misinterpreted as the start of a new codepoint.
This makes `memmem()` a valid substring search on UTF-8! With most legacy multi-byte encodings this would fail, but with UTF-8 it works!
You still have this problem in the "list of Unicode code points" world, since many multi-code-point emoji sequences appear to users as a single character, but start and end with code points that are valid emojis on their own.
Python 3 believes that the string "[Saint Lucia flag emoji][Andorra flag emoji]" contains a Canada flag emoji.
> People who like "list of Unicode code points" string types in languages like Rust and Python 3
Rust and Python 3 have very different string representations. Rust's String is a bunch of UTF-8 encoded bytes. Python 3's String is a sequence of codepoints, and the size changes based on the encoding.
Yeah, implementation-wise, Rust's version of this idea is a little better, since at least you can convert their strings to a bag-of-bytes/OsStr-style representation almost for free. (And to their credit, Rust's docs for the chars() string method discuss why it's not very useful.)
I do think the basic motivation of Unicode codepoints somehow being a better / more correct way to interact with strings is the same in both languages, though. Certainly a lot of people, including the grandparent comment, defend Rust strings using exactly the same arguments as Python 3 strings.
Rust and Go string implementations are very different.
Rust strings are safe, have rich API and prevent you from corrupting their contents unless you go out of your way to do so. Go strings are a joke and won't do any validation if you slice them incorrectly and are extremely bare bones.
Thanks for your response. Personally I fall into the "strings are arrays of bytes" camp (which is also shared by Go). A difference between my view and that of the Go designers is that I don't feel that it is important to support Unicode by default and am perfectly happy to assume that every character corresponds to a single byte. Obviously that makes internationalization harder, but the advantage is that strings are much simpler to reason about. For example, the number of characters in the string is simply the length of the string. I would be fine having a separate Unicode string type in the standard library for those instances when you really need Unicode; this design makes the common case much simpler at the expense of making the rare case harder. I also don't see that mutability is such a huge deal unless you absolutely insist that your language support string interning.
>I would be fine having a separate Unicode string type in the standard library for those instances when you really need Unicode; this design makes the common case much simpler at the expense of making the rare case harder.
Even as a native English speaker, I'm extremely uncomfortable with the idea that we're going to make software even more difficult to internationalize than it already is by using completely separate types for ASCII/Latin1-only text and Unicode.
And it's a whole different level of Anglocentric to portray non-English languages as the "rare" case.
So much this. Thinking that only America and the UK matter is something that was forgivable 40 years ago but not today. It’s even more bizarre because of what you point out - emojis don’t make sense if you consider them as single byte arrays. And lastly, even if you only consider input boxes that don’t accept emojis like names or addresses, you have to remember that America is a nation of immigrants. A lot of folks have names that aren’t going to fit in ASCII.
And this stuff actually matters! In a legal, this-will-cost-us-money kind of way! In 2019 a bank in the EU was penalised because they wouldn’t update a customer’s name to include diacritics (like á, è, ô, ü, ç). Their systems couldn’t support the diacritics because it was built in the 90s with an encoding invented in the 60s. Not their fault but they were still penalised. (https://shkspr.mobi/blog/2021/10/ebcdic-is-incompatible-with...)
It is far more important that strings be utf-8 encoded than they be indexable like arrays. Rust gets this right and I hope future languages will too.
On paper you're not wrong, but String used for localized text are a special subclass you can deal with separately. Most Strings that will cause you problems are, you know, technical: logs, name of subsystems, client ids, client-sourced API-provided values which change format across client etc. Those, in my experience, are always ASCII even in China, exactly because nobody wants to deal with too much crap.
Display Strings are simpler to manipulate in most cases: load String from file or form, store back verbatim in DB or memory, you barely do anything other than straight copying, right ?
The way I do in Java is that I always assume and enforce my strings to be ASCII single byte, and if I want to display something localized, somehow, it never really goes through any complex logic where I need to know the encoding: I copy the content with an encoding metadata, and the other side just displays it.
"strings are arrays of bytes" combined with the assumption that "characters are a single byte" sounds basically the same as the "array of code points" that the parent comment is disagreeing with
Sure, but if you're insisting that the string be represented as one byte per character, you end up with the exact same properties with "array of code points" and "array of bytes"
No, it's impossible to do random access to retrieve a character, if you are dealing with code points, because code points do not have a fixed byte size. I thought this a good intro <https://tonsky.me/blog/unicode/>.
> For example, the number of characters in the string is simply the length of the string.
For I/O you need the amount of bytes it occupies in memory and that's always known.
For text processing, you don't actually need to know the length of the text. What you actually need is the ability to determine the byte boundaries between each code point and most importantly each grapheme cluster.
> when you really need Unicode
You always need Unicode. Sorry but it's almost 2024 and I shouldn't even have to justify this.
For I/O, you don't need "strings" at all, you need byte buffers. For text, you need Unicode and everything else is just fundamentally wrong.
> Obviously that makes internationalization harder, but the advantage is that strings are much simpler to reason about.
Internationalization relative to what? Anyway, just pick any language in the world, i.e. an arbitrary one—can you represent it using just ASCII? If so I would like to know what language that is. It seems that Rotokas can be.[1] That’s about 5K speakers. So you can make computer programs for them.
Out of interest, would you also say that "images are arrays of bytes"?
If not, what's the semantic difference?
For me, strings represent text, which is fundamentally linked to language (and all of its weird vagueness and edge-cases). I feel like there's a "Fallacies programmers believe about text" that should exist somewhere, containing items like "text has a defined length" and "two identical pieces of text mean the same thing".
So whilst it's nice to have an implementation that lets you easily "seek to the 5th character", it's not always the case that this is a well defined thing.
I love when the writing gets visibly more unhinged and frustrated with each invalidated assumption. It's like the person's mind is desperately trying to find some small sliver of truth to hold onto but it can't because the rug is getting constantly pulled out from under it.
The short answer is somewhere between Go and Rust strings, which are newer languages that use UTF-8 for interior representation, and also favor it for exterior encoding.
Roughly speaking, Java and JavaScript are in the UTF-16 camp, and Python 2 and 3 are in the code points camp. C and C++ have unique problems, but you could also put them in the code points camp.
So there are at least 3 different camps, and a whole bunch of weird variations, like string width being compile-time selectable in interpreters.
A main design issue is that string APIs shouldn't depend on a mutable global variable -- the default encoding, or default file system encoding. That's an idea that's a disaster in C, and also a disaster in Python.
It leads to buggy programs. Go and Rust differ in their philosophies, but neither of them has that design problem.
Raku introduced the concept of NFG - Normal Form Grapheme - as a way to represent any Unicode string in its logical ‘visual character’ grapheme form. Sequences of combining characters that don’t have a canonical single codepoint form are given a synthetic codepoint so that string methods including regexes can operate on grapheme characters without ever causing splitting side effects.
Of course there are methods for manipulating at the codepoint level as well.
Unfortunately, strings cross at least 3 different problems:
* charset encoding. Cases worth supporting include Ascii, Latin1, Xascii, WeirdLegacyStatefulEncoding, WhateverMyLocaleSaysExceptNotReally, UTF8, UTF16, UCS2, UTF32, and sloppy variants thereof. Note that not supporting sloppiness means it is impossible to access a lot of old data (for example, `git` committers and commit messages). Note that it is impossible to make a 1-to-1 mapping between sloppy UTF-8 and sloppy UTF-16, so if all strings have a single representation (unless it is some weird representation not yet mentioned), it is either impossible to support all strings encountered on non-Windows platforms, or impossible to support all strings encountered on non-Windows platforms. I am a strong proponent of APIs supporting multiple compile-time-known string representations, with transparent conversion where safe.
* ownership. Cases worth supporting: Value (but finite size), Refcounted (yes, this is important, the problem with std::string is that it was mutable), Tail or full Slice thereof, borrowed Zero-terminated or X(not) terminated, Literal (known statically allocated), and Alternating (the one that does SSO, switching between Value and Refcounted; IME it is important for Refcounted to efficiently support Literal). Note that there is no need for an immutable string to support Unique ownership. That's 8 different ownership policies, and if your Z/X implementation doesn't support recovering an optional owner, you also need to support at least one "maybe owned, maybe borrowed" (needed for things like efficient map insertion if the key might already exist; making the insert function a template does not suffice to handle all cases). It is important that, to the extent possible, all these (immutable) strings offer the same API, and can be converted implicitly where safe (exception: legacy code might make implicit conversion to V useful, despite being technically wrong).
(there should be some Mutable string-like thing but it need not provide the API, only push/pop off the end followed by conversion to R; consider particularly the implementation of comma-separated list stringification)
* domain meaning. The language itself should support at least Format strings for printf/scanf/strftime etc. (these "should" be separate types, but if relying on the C compiler to check them for you, don't actually have to be). Common library-supported additions include XML, JSON, and SQL strings, to make injection attacks impossible at the type level. Thinking of compilers (but not limited to them), there also needs to be dedicated types for "string representing a filepath" vs "string representing a file's contents" (the web's concept of "blob URL" is informative, but suffers from overloading the string type in the first place). Importantly, it must be easy to write literals of the appropriate type and convert explicitly as needed, so there should not be any file-related APIs that take strings.
(related, it's often useful to have the concept of "single-line string" and "word" (which, among other cases, makes it possible to ); the exact definition thereof depending on context. So it may be useful to be able to tag strings as "all characters (or, separately, the first character or the last character (though "last character" is far less useful)) are one of [abc...]"; reasonably granularity being: NUL, whitespace characters individually, other C0 controls, ASCII symbols individually, digit 0, digit 1, digits 2-7, digits 8-9, letters a-f, letters A-F, letters g-z, letters G-Z, DEL, C1 controls, other latin1, U+0100 through U+07FF, U+0800 through U+FFFF excluding surrogates, low surrogates, high surrogates, and U+10000 through U+10FFFF, and illegal values U+110000 and higher (maybe splitting 31-bit from 32-bit?) (several of these are impossible under certain encodings and strictness levels). Actually supporting all of this in the compiler proper is complicated and likely to be deferred, but thinking about it informs both language and library design. Particularly, consider "partial template casting")
I agree with all of this. I remember way back when i was doing CORBA programming (argh!) thinking "can these stupid bastards not specify a simple string class??" To have the most commonly used data type be so complicated makes me think we have got things deeply wrong somewhere.
The Tower of Babylon story seems to tell the story of where we went wrong.
I’m only partially kidding, because I think this is a fundamental-and-ancient-issue of writing (information) technology. As soon as different groups went to encode their language, this problem was born.