Numeronyms

Lately one of the strange concepts that interest me is that of numeronyms. Those funny shortenings of long words, where the middle letters are replaced with the count of those letters. For example, "example" would become "e5e".

My first exposure to them came while I worked on the localization of the now defunct Nike+ Connect desktop app, which supported 13 different languages. I had a lot of fun on that project and learned a lot about world languages and orthography, and the technologies behind their computer representation (Unicode, other historical text encodings, input methods, GNU gettext, C locales, etc).

I also learned that the words "localization" and "internationalization" are too long for many people to write out in full. They are commonly shortened to "l10n" and "i18n", respectively. I understand that those are long English words, but I found the practice to be strangely Anglo-centric and therefore a little arrogant.

The point of localization is to enable users to interact with applications in their native language and that requires us to realize that not everyone speaks the same language we do. Proper software localization is a difficult and somewhat tedious task that could be seen as boring, especially by monoglots. But I see it as a sign of respect. Such numeronym shortenings of "localization" and "internationalization" seem to belittle that important task. But perhaps this is just my own opinion and a polyglot would be apt to shorten as well.

But there are other problems with numeronyms.

Language understanding has a large statistical component. The expansion of "l10n" requires an in-depth knowledge of English (assuming that the numeronym shortens an English word in the first place) and what words might possibly fit, which of course depends on the context. For example, "l10n" can be expanded to the following English words:

legalization
liquefaction
localization
longshoreman
longshoremen

Depending on who you are, each of the expansions might be more important than the others.

To help illustrate this point I wrote the following bash script around a relatively simple grep search through /usr/share/dict/words:

~/scripts $ cat ./annoyances.sh
#!/bin/bash

if [[ $1 =~ ^([a-z])([0-9]+)([a-z])$ ]] ; then
  grep "^${BASH_REMATCH[1]}[^']\{${BASH_REMATCH[2]}\}${BASH_REMATCH[3]}$" /usr/share/dict/words
else
  echo Poor form.
fi

Prior to this I didn't know that bash had a =~ operator similar to Perl. I use it here to parse the input as first letter, count, and last letter. Unfortunately, the BASH_REMATCH array variable it stores results to can't be easily renamed, so the one-line grep command is a little uglier than it could be. It simply tries to match the input pattern with the proper number of non-apostrophe characters.

Let's run it on "k8s", another recent numeronym I've seen:

$ ./annoyances.sh k8s
kerchieves
keypunches
keystrokes
kickstands
kidnappers
kilocycles
kilometers
kindliness
kindnesses
kinematics
kohlrabies

Hmm, it doesn't seem like the intended expansion is in here. And that's because it's actually the Greek "kubernetes" meaning "helmsman".

So in a way, I realize that a numeronym used often enough and with enough context might develop a well-understood meaning across languages. Acronyms, of course have similar problems and possibilities.

Then again, you might wrongly assume that the "RUN K8S" shirts out there mean "Run Kilometers". And if you wanted to shorten the name of my script to "a8s" you'd have 171 possible expansions (at least with my system's dictionary).

To explore those problems and possibilities further, I wrote the inverse of the above script. This one shortens each group of letters into a numeronym:

$ cat numeronym.awk
#!/usr/bin/awk -f

BEGIN { savings = 0 }

{
  nwords = split($0, a, /[^[:alpha:]]+/, seps)
  for (i=1; i<=nwords; i++) {
    if(match(a[i], /^([[:alpha:]])([[:alpha:]]{2,})([[:alpha:]])$/, b)) {
      printf "%s", b[1]length(b[2])b[3]
      savings += length(b[2]) - 1
    }
    else {
      printf "%s", a[i]
    }

    printf "%s", seps[i] 
  }
  printf "\n"
}

END {
  printf "You saved %d chars today!\n", savings
}

The only real complication here is that by default awk discards the field separators when parsing, so we're forced to split() each line ourselves so we can print both the (modified) fields and separators.

Changing the "at least 2" clause from {2,} to {0,} (aka *) leads to the seemingly valid reductions of "it" to "i0t", but since that adds characters I haven't done that here. Likewise, replacing a single letter with the number "1" seems less than helpful.

Despite a parallel with a popular meme that shows recognizing words with jumbled inner letters (where the first and last are kept the same like in numeronyms) can be fairly easy, numeronyms do not seem easy to read.

To provide an example of the resulting reading difficulty, here is the preface to Tom Sawyer:

P5E

M2t of the a8s r6d in t2s b2k r4y o6d; one or two w2e e9s of my own, the r2t t3e of b2s who w2e s9s of m2e. H2k F2n is d3n f2m l2e; Tom S4r a2o, but not f2m an i8l–he is a c9n of the c13s of t3e b2s w2m I k2w, and t7e b5s to the c7e o3r of a10e.

The odd s11s t5d u2n w2e all p7t a3g c6n and s4s in the W2t at the p4d of t2s s3y–t2t is to say, t4y or f3y y3s ago.

A6h my b2k is i6d m4y for the e11t of b2s and g3s, I h2e it w2l not be s5d by men and w3n on t2t a5t, for p2t of my p2n has b2n to try to p8y r4d a4s of w2t t2y o2e w2e t8s, and of how t2y f2t and t5t and t4d, and w2t q3r e9s t2y s7s e5d in.

THE A4R.

H6D, 1876.

You saved 273 chars today!

Now go forth and save the world from the hassle of reading all those extraneous letters!