Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is an unreasonable expectation to have for Unicode anyway.

1. Assume that some letter X has only a lower-case version. It’s represented with two bytes in UTF-8. 2. A capitalized version is added way later 3. There are no more two-byte codepoints available 4. So it has to use three bytes or more

I see people are jumping on the “oh the wicked complexity” bandwagon in here but I don’t see what the big deal is.



Presumably string length corresponds to something like “number of glyphs” rather than byte count.


æ, œ, and fi are one or two? They’re one glyph, obviously, but when you ask ‘how long is “ fish “?’, it would seem odd to answer “3”.


Interestingly, and I can't decide if I hate it or not, when searching for "æ" to get back to this comment after stupidly scrolling away from it, Chrome matched on "ae" in other comments, not just on the letter itself.

In Norwegian, "ae" is an accepted transliteration of "æ", but "æ" is very clearly considered it's own character and we'd expect the answer to how many characters is "æ" to be 1 (which creates fun sorting issues too - "ae" is expected to sort after "z"), just like "aa" is a transliteration of "å" except when it isn't (it's always a transliteration when outside of names; within names all bets are off - some names traditionally uses "aa", some uses "å" and if represented with "aa" it respresents a transliteration; to make it worse, some names exist in both forms - e.g. Haakon vs. Håkon).

Now the interesting question to ask Norwegians is "how many characters are 'ae'?" On one hand the obvious answer is 2, but I might pause to think "hold on, you meant to write "æ" but transliterated it, and the answer should be 1". Except it might occur to me it's a trick question - but it could be a trick question expecting either answer. Argh.

[I just now realised a search for "å" matches "a" and "å", but "aa" is treated as two matches on "a", and that I definitively hate, though I understand it's a usability issue that it matches on "a", and that matching on "aa" makes no sense if the matched term is in another language.]

[EDIT2: I've also done some genealogy recently, and to be honest, the spelling of priests makes it quite hard to hold on to any desire for precision in find/search]


To make things worse (or better), æ, œ, and ß have evolved from ligatures to becoming letters in their own right (a choice that may depend on the language), while fi stayed a ligature. So 'fieß', if there was such a word in German, would count as 4 letters.

Also, talking about 'glyphs', one has to clearly separate between what Unicode encodes and what a given font uses; in the context of fonts, 'glyphs' are also called 'glyfs' or 'outlines', and any OTF my choose to render any given string with any number of glyfs/outlines, including using a visual ligature to display 'f+i' = fi but doing so by nudging the outlines for f and i closely together.


I have no expectations anymore with Unicode, so I’m not surprised at all.


I would be surprised if most languages implement it that way.

Counting code points would be a start and would solve this particular problem. But that’s not glyph count. You really need to count grapheme clusters. For example two strings that are visually identical and contain “é” might have a different number of code points since one might use the combining acute accent codepoint while the other one might use the precomposed “e with accute accent”.

Even a modern language like Rust doesn’t have standard library support for counting grapheme clusters; you have to use a third party one.

And if you do count grapheme clusters then people will eventually complain about that as well (“TIL that getting the length of a String in X takes linear time”).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: