Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> So you mean grapheme clusters or code points?

I mean character count. The unicode standard defines that as separate from and meaningfully different than grapheme clusters or codepoints.

The confusion you're relying on isn't real.

.

> Do you want to count zero-width characters?

Are they characters? Yes? Then I want to count them.

.

> More specifically what are you trying to do?

(narrows eyes)

I want to count characters. You're trying to make that sound confusing, but it really isn't.

I don't care if you think it's a "zero width" character. Zero width space is often not actually zero width in programmers' fonts, and almost every font has at least a dozen of these wrong.

I don't care about whatever other special moves you think you have, either.

This is actually very simple.

The unicode standard has something called a "character count." It is more work than the grapheme cluster count. The grapheme cluster count doesn't honor removals, replacements, substitutions, and it does something different in case folding.

The people trying super hard to show how technically apt they are at the difficulties in Unicode are just repeating things they've heard other people say.

The actual unicode standard makes this straightforward, and has these two terms separated already.

If you genuinely want to undersatnd this, read Unicode 14 chapter 2, "general structure." It's about 50 pages. You can probably get away with just reading 2.2.3.

It is three pages long.

It's called "characters, not glyphs," because ๐ž๐ฏ๐ž๐ง ๐ญ๐ก๐ž ๐š๐ฎ๐ญ๐ก๐จ๐ซ๐ฌ ๐จ๐Ÿ ๐ญ๐ก๐ž ๐”๐ง๐ข๐œ๐จ๐๐ž ๐ฌ๐ญ๐š๐ง๐๐š๐ซ๐ ๐ฐ๐š๐ง๐ญ ๐ฉ๐ž๐จ๐ฉ๐ฅ๐ž ๐ญ๐จ ๐ฌ๐ญ๐จ๐ฉ ๐ฉ๐ซ๐ž๐ญ๐ž๐ง๐๐ข๐ง๐  ๐ญ๐ก๐š๐ญ "๐œ๐ก๐š๐ซ๐š๐œ๐ญ๐ž๐ซ" ๐ฆ๐ž๐š๐ง๐ฌ ๐š๐ง๐ฒ๐ญ๐ก๐ข๐ง๐  ๐จ๐ญ๐ก๐ž๐ซ ๐ญ๐ก๐š๐ง ๐š ๐Ÿ๐ฎ๐ฅ๐ฅ๐ฒ ๐š๐ ๐ ๐ซ๐ž๐ ๐š๐ญ๐ž๐ ๐ฌ๐ž๐ซ๐ข๐ž๐ฌ.

The word "character" is well defined in Unicode. If you think it means anything other than point 2, ๐’š๐’๐’– ๐’‚๐’“๐’† ๐’”๐’Š๐’Ž๐’‘๐’๐’š ๐’Š๐’๐’„๐’๐’“๐’“๐’†๐’„๐’•, ๐’๐’๐’• ๐’‘๐’๐’š๐’Š๐’๐’ˆ ๐’‚ ๐’…๐’†๐’†๐’‘ ๐’–๐’๐’…๐’†๐’“๐’”๐’•๐’‚๐’๐’…๐’Š๐’๐’ˆ ๐’๐’‡ ๐’„๐’‰๐’‚๐’“๐’‚๐’„๐’•๐’†๐’“ ๐’†๐’๐’„๐’๐’…๐’Š๐’๐’ˆ ๐’•๐’๐’‘๐’Š๐’„๐’”.

All you need is on pages 15 and 16. Go ahead.

.

I want every choice to be made in accord with the Unicode standard. Every technicality you guys are trying to bring up was handled 20 years ago.

These words are actually well defined in the context of Unicode, and they're non-confusing in any other context. If you struggle with this, it is by choice.

Size means byte count. Length means character count. No, it doesn't matter if you incorrectly pull technical terminology like "grapheme clusters" and "code points," because I don't mean either of those. I mean ๐œ๐ก๐š๐ซ๐š๐œ๐ญ๐ž๐ซ ๐œ๐จ๐ฎ๐ง๐ญ.

If you have the sequence `capital a`, `zero width joiner`, `joining umlaut`, `special character`, `emoji face`, `skin color modifier`, you have:

1. Six codepoints 2. Three grapheme clusters 3. Four characters

Please wait until you can explain why before continuing to attempt to teach technicalities, friend. It doesn't work the way you claim.

This might help.

https://www.unicode.org/versions/Unicode14.0.0/UnicodeStanda...

Here, let's save you some time on some other technicalities that aren't.

1. If you write a unicode modifying character after something that cannot be modified - like adding an umlaut to a zero width joiner - then the umlaut will be considered a full character, but also discarded. Should it be counted? According to the Unicode standard, yes. 2. If you write a zero width anything, and it's a character, should it be counted? Yes. 3. If you write something that is a grapheme cluster, but your font or renderer doesn't support it, so it renders as two characters (by example, they added couple emoji in Unicode 10 and gay couples in Unicode 11, so phones that were released inbetween would render a two-man couple as two individual men.) Should that be counted as a single character as it's written, or a double character as it's rendered? Single. 4. If you're in a font that includes language variants - by example, Swiss German fancy fonts sometimes separate the S Tset into two distinct S characters that are flowed separately - should that be calculated as one character or two? Two, it turns out. 5. If a character pair is kerned into a single symbol, like an English cursive capital F to lower case E, should that be counted as one character or two? Two.

There's a whole chapter of these. If you really enjoy going through them, stop quizzing me and just read it.

These questions have been answered for literal decades. Just go read the standard already.



You just said there was one canonical length for strings then gave me 4: bytes, code points, grapheme clusters and characters, each of which applies in a different context.

Iโ€™m not 100% sure what context characters even apply in on a computer other than interest sake.

Invisible/zero-width characters are not interesting when editing text, and character count doesnโ€™t correlate with size, therefore thereโ€™s no canonical length.


> You just said there was one canonical length for strings then gave me 4: bytes, code points, grapheme clusters and characters, each of which applies in a different context.

The other things have different labels than "length," which you've already been told.

.

> Invisible/zero-width characters are not interesting when editing text

Well, tell Unicode they're wrong, then, I guess.

Have a good day.


To be clear I appreciate you sharing the official characters concept definition, and I do think itโ€™s valuable.


  > These words are actually well defined in the context of Unicode, and
  > they're non-confusing in any other context.
John, you've set right so many misconceptions here, and taught much. I appreciate that. However, unfortunately, this sentence I must disagree with. Software developers are not reading the spec, and the terms therefore ๐’‚๐’“๐’† confusing them.

As with security issues, I see no simple solution to get developers to familiarize themselves with even the basics before writing code, nor a way to get PMs to vet and hire such developers. Until the software development profession becomes accredited like engineering or law, Unicode and security issues (and accessibility, and robustness, and dataloss, and performance, and maintainability issues) will continue to plague the industry.


Something isn't confusing merely because people fail to try.

It's pretty easy to swap your spark plugs but most people never learn how. That doesn't mean it's secretly hard, though.

Any half-competent programmer who sat down and tried to learn it would succeed immediately


Just wanted to say thnx.

This was informative.


No problem, boblem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: