Finding Anagrams with Persistent Memory Gawk (pm-gawk)

This was a fun diversion while I was in a hotel room for a work trip. I realized that I could efficiently determine if two words were anagrams if I assigned each letter a prime number value and treated each word as a prime factorization of an integer. The commutative property of multiplication means that every word using the same letters (the anagrams) will multiply to the same number, and that number will be unique because each letter value is prime.

While I was prototyping it with Awk I remembed about the Persistent Memory feature in Gawk which could be used to cache the precomputed table of integer values for an entire dictionary in a hash table. Subsequent calls to the script would use the cached hash table for efficient lookups. This was perfect, because I've been looking for a fun practice project for this feature since I first read about it in ACM's Queue last year.

First I counted the letter frequencies in /usr/share/dict/word, and then assigned the lowest prime numbers to the most frequently occuring letters. This isn't necessary, but was fun and easy and minimizes the numeric value of each word.

Here are the assigned letter values:

BEGIN {
    letter_val["e"] = 2
    letter_val["s"] = 3
    letter_val["i"] = 5
    letter_val["a"] = 7
    letter_val["r"] = 11
    letter_val["n"] = 13
    letter_val["t"] = 17
    letter_val["o"] = 19
    letter_val["l"] = 23
    letter_val["c"] = 29
    letter_val["d"] = 31
    letter_val["u"] = 37
    letter_val["g"] = 41
    letter_val["p"] = 43
    letter_val["m"] = 47
    letter_val["h"] = 53
    letter_val["b"] = 59
    letter_val["y"] = 61
    letter_val["f"] = 67
    letter_val["v"] = 71
    letter_val["k"] = 73
    letter_val["w"] = 79
    letter_val["z"] = 83
    letter_val["x"] = 89
    letter_val["j"] = 97
    letter_val["q"] = 101
}

An example:

\[t \cdot a \cdot x \cdot e \cdot s = t \cdot e \cdot x \cdot a \cdot s =\] \[17 \cdot 7 \cdot 89 \cdot 2 \cdot 3 = 17 \cdot 2 \cdot 89 \cdot 7 \cdot 3 = 63546\]

Here's the code that builds the lookup table. The persistent memory allocator (PMA) will automatically save both the hash table words and the function word_val() to a file when the GAWK_PERSIST_FILE environment variable is set to a file name. Generating this table takes about 0.54s on my machine.

# Get the prime values for each letter.
@include "letter_values.awk"

# Multiply the prime values for each letter in a word.
function word_val(word) {
    len = split(word, letters, "")
    val = 1
    for (i=1; i <= len; i++) {
        val *= letter_val[letters[i]]
    }
    return val
}

# Store the word values for each input word in a hash table.
{
    val = word_val($0)
    words[val] = words[val] ? words[val] " " $0 : $0
    print val, words[val]
}

And here's the code that performs the lookup. It will automatically have access to the words hash table and word_val() function when GAWK_PERSIST_FILE is set. Therefore the lookup is a single line of code. Running this takes about 0.005 seconds on my machine. Storing the pre-computed hash table in the PMA file saves 0.54s on each run.

# Look up anagrams for words based on their prime factorization value.
# This requires a .pma file with words and word_val defined.
{ print $0 " -> " words[word_val($0)] }

One usage warning: if you run the script interactively (i.e. ./anagram.awk) then you should close it nicely with ctrl-d (EOF). If you interrupt it with ctrl-c it will corrupt the PMA heap and the next run will segfault. You have to delete the PMA heap file and rebuild the lookup table after that.

Here's a tarball with the source and a Makefile:

anagrams.tar.gz