The crazy world of stripping diacritics

weinzierl · on Nov 26, 2014

The article mentions one rationale for stripping diacritics and I won't deny there are others. That being said:

Stripping diacritics from text will annoy people whose language uses those diacritics. For us the difference between an e, an è and an é is significant. An ü is something entirely different from an u. Calling me Muller when my name is Müller is like calling someone Jan whose name is Jon.

Just as the article says: "But then again, removing diacritics is already linguistically nonsensical. Nonsensical operation is nonsensical."

TillE · on Nov 26, 2014

It's kind of strange how a huge number of monolingual anglophones don't seem to grasp this at all. Witness nearly every English newspaper talking about German football players who are apparently called Muller, Ozil, etc. They are actually different vowels, not just decoration.

Mispronunciations are understandable, but this is as simple as copy/paste.

mercurial · on Nov 26, 2014

For some languages (French) it can make sense to transliterate when sorting, while it would be a terrible mistake in others (Danish).

ddebernardy · on Nov 26, 2014

Mm, it actually never makes sense to transliterate for sorting. Sorting should be based on collation rules.

In French, sorting takes diacritics into account, and how it does so additionally depends on whether you're doing French French or Canadian French:

French:

cote, côte, coté, côté

Canadian:

cote, coté, côte, côté

http://userguide.icu-project.org/collation/concepts

ars · on Nov 26, 2014

If you needed to do this in PHP (or any other language with the ICU transliterator):

    transliterator_transliterate('Any-Latin; Latin-ASCII', 'Input string');

It's not exactly the same thing - it will convert letters into ASCII characters that sort of sound right, not just strip diacritics.

It's possible to simply strip diactricts too, probably something like:

    'NFD; [:Punctuation:] Remove;'

_ondq · on Nov 26, 2014

This is a pretty bad idea (and English-centric). Stripping diacritics can fundamentally alter a word's meaning.

A small example (Portuguese):

  país  ("country")
  pais  ("fathers")

itsybitsycoder · on Nov 27, 2014

The goal of the exercise is to use the stripped text to check for spam. That doesn't mean that the stripped text is what ends up in the user's inbox. The idea is to see if the text contains things like ＶᎥÄｇԻａ, not whether a given word means "country" or "father". This would only be a problem if some set of characters that looked like "viagra" was actually a valid non-spammy word in some language.

There was also an article here a few weeks back about Russian government officials securing fat contracts for their friends in private industry by intentionally replacing one or two letters in the common search terms of their bid requests with Latin lookalike characters. This was impossible to detect while reading the bid request, but also made it impossible for other contractors to find the bid request through the website. As a result only their buddy would submit a bid, and at much higher than market rate. Something similar to this, but converting down to Cyrillic characters instead of ASCII, could be used to check for hinky bid requests on upload.

diacritics · on Nov 26, 2014

Or in spanish:

Trabajé ("I worked.")

Trabaje ("You work." (formal))

BerislavLopac · on Nov 27, 2014

I'll just leave this here... https://pypi.python.org/pypi/Unidecode