I dunno about most languages - both JVM and CLR use UTF-16 for internal represen...

chrismorgan · on Nov 23, 2021

Neither the JVM nor the CLR guarantee UTF-16 representation.

From Java 9 onwards, the JVM defaults to using compact strings, which means mixed ISO-8859-1/UTF-16. The command line argument -XX:-CompactStrings disables that.

CLR, I don’t know. But presuming it’s still pure UTF-16, it could still change that as it’s an implementation detail.

(As for UTF-16, not only is it an ugly hack, it’s a hack that ruined Unicode for all the other transformation formats.)

I don’t think Python’s approach was at all sane. The root problem is they made strings sequences of Unicode code points rather than of Unicode scalar values or even UTF-16 code units. (I have a vague recollection of reading some years back that during the py3k endeavour they didn’t have or consult with any Unicode experts, and realise with hindsight that what they went with is terrible.) This bad foundation just breaks everything, so that they couldn’t switch to a sane internal representation. I described the current internal representation as mixed ASCII/UTF-16/UTF-32, but having gone back and read PEP 393 now (implemented in Python 3.3), I’d forgotten just how hideous it is: mixed Latin-1/UCS-2/UCS-4, plus extra bits and data assigned to things like whether it’s ASCII, and its UTF-8 length… sometimes. It ends up fiendishly complex in their endeavour to make it more consistent across narrow architectures and use less memory, and is typically a fair bit slower than what they had before.

Many languages have had an undefined internal representation, and it’s fairly consistently caused them at least some grief when they want to change it, because people too often inadvertently depended on at least the performance characteristics of the internal representation.

By comparison, Rust strings have been transparent UTF-8 from the start—having admittedly the benefit of starting later than most, so that UTF-8 being the best sane choice was clear—which appropriately guides people away from doing bad things by API, except for the existence of code-unit-wise indexing via string[index] and string.len(), which I’m not overly enamoured of (such indexing is essentially discontinuous in the presence of multibyte characters, panicking on accessing the middle of a scalar value, making it too easy to mistake for code point or scalar value indexing). You know what you’re dealing with, and it’s roughly the simplest possible thing and very sane, and you can optimise for that. And Rust can’t change its string representation because it’s public rather than implementation detail.