Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The crazy world of stripping diacritics (msdn.com)
17 points by gus_massa on Nov 26, 2014 | hide | past | favorite | 9 comments


The article mentions one rationale for stripping diacritics and I won't deny there are others. That being said:

Stripping diacritics from text will annoy people whose language uses those diacritics. For us the difference between an e, an è and an é is significant. An ü is something entirely different from an u. Calling me Muller when my name is Müller is like calling someone Jan whose name is Jon.

Just as the article says: "But then again, removing diacritics is already linguistically nonsensical. Nonsensical operation is nonsensical."


It's kind of strange how a huge number of monolingual anglophones don't seem to grasp this at all. Witness nearly every English newspaper talking about German football players who are apparently called Muller, Ozil, etc. They are actually different vowels, not just decoration.

Mispronunciations are understandable, but this is as simple as copy/paste.


For some languages (French) it can make sense to transliterate when sorting, while it would be a terrible mistake in others (Danish).


Mm, it actually never makes sense to transliterate for sorting. Sorting should be based on collation rules.

In French, sorting takes diacritics into account, and how it does so additionally depends on whether you're doing French French or Canadian French:

French:

cote, côte, coté, côté

Canadian:

cote, coté, côte, côté

http://userguide.icu-project.org/collation/concepts


If you needed to do this in PHP (or any other language with the ICU transliterator):

    transliterator_transliterate('Any-Latin; Latin-ASCII', 'Input string');
It's not exactly the same thing - it will convert letters into ASCII characters that sort of sound right, not just strip diacritics.

It's possible to simply strip diactricts too, probably something like:

    'NFD; [:Punctuation:] Remove;'


This is a pretty bad idea (and English-centric). Stripping diacritics can fundamentally alter a word's meaning.

A small example (Portuguese):

  país  ("country")
  pais  ("fathers")


The goal of the exercise is to use the stripped text to check for spam. That doesn't mean that the stripped text is what ends up in the user's inbox. The idea is to see if the text contains things like VᎥÄgԻa, not whether a given word means "country" or "father". This would only be a problem if some set of characters that looked like "viagra" was actually a valid non-spammy word in some language.

There was also an article here a few weeks back about Russian government officials securing fat contracts for their friends in private industry by intentionally replacing one or two letters in the common search terms of their bid requests with Latin lookalike characters. This was impossible to detect while reading the bid request, but also made it impossible for other contractors to find the bid request through the website. As a result only their buddy would submit a bid, and at much higher than market rate. Something similar to this, but converting down to Cyrillic characters instead of ASCII, could be used to check for hinky bid requests on upload.


Or in spanish:

Trabajé ("I worked.")

Trabaje ("You work." (formal))


I'll just leave this here... https://pypi.python.org/pypi/Unidecode




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: