Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I dunno about most languages - both JVM and CLR use UTF-16 for internal representation, for example, hence every language that primarily targets those does that also; and that's a huge slice of the market right there.

Regarding Windows, the page you've linked to doesn't recommend using CP65001 and the -A functions over -W ones. It just says that if you already have code written with UTF-8 in mind, then this is an easy way to port it to modern Windows, because now you actually fully control the codepage for your app (whereas previously it was a user setting, exposed in the UI even, so you coulnd't rely on it). But, internally, everything is still UTF-16, so far as I know, and all the -A functions basically just convert to that and call the -W variant in turn. Indeed, that very page states that "Windows operates natively in UTF-16"!

FWIW I personally hate UTF-16 with a passion and want to see it die sooner rather than later - not only it's an ugly hack, but it's a hack that's all about keeping doing the Wrong Thing easy. I just don't think that it'll happen all that fast, so for now, accommodations must be made. IMO Python has the right idea in principle by allowing multiple internal encodings for strings, but not exposing them in the public API even for native code.



Neither the JVM nor the CLR guarantee UTF-16 representation.

From Java 9 onwards, the JVM defaults to using compact strings, which means mixed ISO-8859-1/UTF-16. The command line argument -XX:-CompactStrings disables that.

CLR, I don’t know. But presuming it’s still pure UTF-16, it could still change that as it’s an implementation detail.

(As for UTF-16, not only is it an ugly hack, it’s a hack that ruined Unicode for all the other transformation formats.)

I don’t think Python’s approach was at all sane. The root problem is they made strings sequences of Unicode code points rather than of Unicode scalar values or even UTF-16 code units. (I have a vague recollection of reading some years back that during the py3k endeavour they didn’t have or consult with any Unicode experts, and realise with hindsight that what they went with is terrible.) This bad foundation just breaks everything, so that they couldn’t switch to a sane internal representation. I described the current internal representation as mixed ASCII/UTF-16/UTF-32, but having gone back and read PEP 393 now (implemented in Python 3.3), I’d forgotten just how hideous it is: mixed Latin-1/UCS-2/UCS-4, plus extra bits and data assigned to things like whether it’s ASCII, and its UTF-8 length… sometimes. It ends up fiendishly complex in their endeavour to make it more consistent across narrow architectures and use less memory, and is typically a fair bit slower than what they had before.

Many languages have had an undefined internal representation, and it’s fairly consistently caused them at least some grief when they want to change it, because people too often inadvertently depended on at least the performance characteristics of the internal representation.

By comparison, Rust strings have been transparent UTF-8 from the start—having admittedly the benefit of starting later than most, so that UTF-8 being the best sane choice was clear—which appropriately guides people away from doing bad things by API, except for the existence of code-unit-wise indexing via string[index] and string.len(), which I’m not overly enamoured of (such indexing is essentially discontinuous in the presence of multibyte characters, panicking on accessing the middle of a scalar value, making it too easy to mistake for code point or scalar value indexing). You know what you’re dealing with, and it’s roughly the simplest possible thing and very sane, and you can optimise for that. And Rust can’t change its string representation because it’s public rather than implementation detail.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: