This is an unreasonable expectation to have for Unicode anyway. 1. Assume that s...

ironmagma · on Nov 22, 2021

Presumably string length corresponds to something like “number of glyphs” rather than byte count.

sokoloff · on Nov 22, 2021

æ, œ, and ﬁ are one or two? They’re one glyph, obviously, but when you ask ‘how long is “ ﬁsh “?’, it would seem odd to answer “3”.

vidarh · on Nov 22, 2021

Interestingly, and I can't decide if I hate it or not, when searching for "æ" to get back to this comment after stupidly scrolling away from it, Chrome matched on "ae" in other comments, not just on the letter itself.

In Norwegian, "ae" is an accepted transliteration of "æ", but "æ" is very clearly considered it's own character and we'd expect the answer to how many characters is "æ" to be 1 (which creates fun sorting issues too - "ae" is expected to sort after "z"), just like "aa" is a transliteration of "å" except when it isn't (it's always a transliteration when outside of names; within names all bets are off - some names traditionally uses "aa", some uses "å" and if represented with "aa" it respresents a transliteration; to make it worse, some names exist in both forms - e.g. Haakon vs. Håkon).

Now the interesting question to ask Norwegians is "how many characters are 'ae'?" On one hand the obvious answer is 2, but I might pause to think "hold on, you meant to write "æ" but transliterated it, and the answer should be 1". Except it might occur to me it's a trick question - but it could be a trick question expecting either answer. Argh.

[I just now realised a search for "å" matches "a" and "å", but "aa" is treated as two matches on "a", and that I definitively hate, though I understand it's a usability issue that it matches on "a", and that matching on "aa" makes no sense if the matched term is in another language.]

[EDIT2: I've also done some genealogy recently, and to be honest, the spelling of priests makes it quite hard to hold on to any desire for precision in find/search]

DemocracyFTW · on Nov 22, 2021

To make things worse (or better), æ, œ, and ß have evolved from ligatures to becoming letters in their own right (a choice that may depend on the language), while ﬁ stayed a ligature. So 'ﬁeß', if there was such a word in German, would count as 4 letters.

Also, talking about 'glyphs', one has to clearly separate between what Unicode encodes and what a given font uses; in the context of fonts, 'glyphs' are also called 'glyfs' or 'outlines', and any OTF my choose to render any given string with any number of glyfs/outlines, including using a visual ligature to display 'f+i' = ﬁ but doing so by nudging the outlines for f and i closely together.

ironmagma · on Nov 22, 2021

I have no expectations anymore with Unicode, so I’m not surprised at all.

avgcorrection · on Nov 22, 2021

I would be surprised if most languages implement it that way.

Counting code points would be a start and would solve this particular problem. But that’s not glyph count. You really need to count grapheme clusters. For example two strings that are visually identical and contain “é” might have a different number of code points since one might use the combining acute accent codepoint while the other one might use the precomposed “e with accute accent”.

Even a modern language like Rust doesn’t have standard library support for counting grapheme clusters; you have to use a third party one.

And if you do count grapheme clusters then people will eventually complain about that as well (“TIL that getting the length of a String in X takes linear time”).