Running Stats with Awk

Here's some Awk code to implement the Welford algorithm as described by John D. Cook. He cites Knuth's TAOCP Vol. 2, 3rd ed., p. 232, which is recommended reading. I added a root-mean-square (RMS) calculation as well, since that is useful for electrical engineering work.

stats.awk:

#!/usr/bin/awk -f

# Copyright (C) 2024 Remington Furman
#
# Permission is hereby granted, free of charge, to any person
# obtaining a copy of this software and associated documentation files
# (the "Software"), to deal in the Software without restriction,
# including without limitation the rights to use, copy, modify, merge,
# publish, distribute, sublicense, and/or sell copies of the Software,
# and to permit persons to whom the Software is furnished to do so,
# subject to the following conditions:
#
# The above copyright notice and this permission notice shall be
# included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

# This prints running statistics for a single input variable.
#
# It implements the Welford algorithm as described by:
# https://www.johndcook.com/blog/standard_deviation/
#
# It also calculates the root-mean-square (RMS) of the input.

BEGIN {OFS=", ";        print  "n, x, sum, mean, var, stddev, rms"}
function print_stats() {print NR, $1, sum, currM, var, sqrt(var), sqrt(sum2/NR)}

NR==1 {
    sum = prevM = currM = $1
    sum2 = $1 * $1
    var = prevS = 0.0
    print_stats()
}

NR!=1 {
    sum  += $1
    sum2 += $1 * $1
    currM = prevM + ($1 - prevM)/NR
    currS = prevS + ($1 - prevM)*($1 - currM)
    prevM = currM
    prevS = currS
    var   = currS/(NR-1)
    print_stats()
}

As a quick test, I found that the output has zero error (except an occasional round-trip floating-point parse error) compared to the SUM(), AVERAGE(), VAR(), and STDEV() functions in LibreOffice Calc, when printed with OFMT="%.16g". Both gawk and LibreOffice Calc use double-precision floating point math by default.

for i in {1..100} ; do echo $RANDOM ; done > rand.txt
gawk -v OFMT="%.16g" -f ./stats.awk rand.txt > rand_stats.csv

It also outputs expected values for Awk Generated White Noise (mean approximately zero; variariance, standard deviation, and RMS approximately one), and expected values for a sine wave (mean of zero and RMS of \(\sqrt{2}\)).

Here's a spreadsheet which compares script output and LibreOffice Calc results on random integers, white noise, and a sine wave: awk_stats.ods

Fun fact: RMS is the same as standard deviation when the mean is zero. Check the equations and these two posts for more on that:

John D. Cook also wrote a version of his C++ class that includes skewness and kurtosis. I don't use them, so adding them to this script is left as an exercise for the reader.

© Copyright 2024, Remington Furman

blog@remcycles.net

@remcycles@subdued.social