Combining progressive variances
So, i ran across a math problem that i don’t know how to solve.
“The sequences were eating through my 4G of RAM and larger swap and still demanding more.”
1 I tend to think of building categorizers in four phases: tokenization, feature selection, model building, and model testing.
Recently, i was building a text-based categorizer for a client. During the feature selection1 phase, i needed to calculate the average and standard deviation for each potential feature. In my early runs, i implemented this straightforwardly by keeping the per-document counts in per-feature sequences. This worked fine for a couple dozen documents, but on the actual training set (which holds a few thousand documents), this expanded predictably rapidly. The sequences were proportionally longer, but the number of features (and thus, the number of sequences) also increased (sublinearly, but enough). Long story short, despite my beefy workstation, the sequences were eating through my 4G of RAM and larger swap and still demanding more.
The solution is obvious: use progressive mean and variance formulae, then transform the variance into the standard deviation at the end. For those of you unfamiliar with this technique, i’ll go over the progressive mean.
First, i’m talking about the arithmetic mean, which, in non-math terms, is just called “the average”. But i’m feeling all mathy, so it’s the mean for now. Remember how to calculate a mean (aka μ, the Greek letter “mu”, which is m-like and is used to spell μean):
μ = (x 1 + x 2 + ... + x n) / n
Calculating that requires access to the entire sequence x. Let’s say we want to calculate the mean of everything up to a given point (t) in the sequence.
μ t = (x 1 + ... + x t) / t
Not much difference so far. Let’s compare the mean of everything up to t + 1:
μ t+1 = (x 1 + ... + x t + x t+1) / (t + 1)
We can factor:
μ t+1 = ((x 1 + ... + x t) / t) * (t / (t + 1)) + (x t+1 / (t + 1))
That first part is μ t, which we can turn around and plug in:
μ t+1 = (t * μ t) / (t + 1) + (x t+1 / (t + 1))
Which no longer needs access to x 1 through x t-1. Which means we don’t need the whole sequence as long as we have access to a running mean (μ t), the next item in the sequence (x t+1), and the number of items so far (t).
That’s called the progressive mean. Or the cumulative mean. Or running mean. Or recursive mean.
2 Which i suspected, and saw many references to, but didn’t understand enough to implement until finding this tutorial.
It turns out you can do the same thing with variance2. Which is nice, because i now had the ability to get rid of my pesky sequences.
Except for the wrinkle. You knew there would be a wrinkle, right?
“If i know of two sequences which share a length, and have their average and variance, can i find the average and variance of the sequence which is their sum?”
The sequences i’m working with are not wholly independent from eachother. There are superficial similarities (e.g. they’re all the same length, equal to the number of documents in my training set). But, for reasons i won’t get into, i sometimes needed to combine these sequences. That is, if i had:
x = [x 1, x 2, ... x n ]
y = [y 1, y 2, ... y n ]
Sometimes what i really wanted was:
z = [(x 1 + y 1), (x 2 + y 2), ... (x n + y n)]
The kicker’s that i don’t know which x‘s and y‘s to combine until after the tokenizing. Which means i can’t combine them proactively and take the progressive stats as i go.
So here’s the question: If i know of two sequences which share a length, and have their average and variance, can i find the average and variance of the sequence which is their sum?
The average is easy. Just add μ x and μ y to get μ z. The variance is a different story. Any help out there?
By john on October 3, 2007
Comments
By RepIRet @ 5:47 p.m.
Thank you! I knew it would be something obvious like that.
Not only is there less rounding error, it’s also all-around less computationally intensive. Thank you very much.
By john @ 3:13 p.m.
Comments currently disabled due to impressive comment-spam efforts.
You should store a running sum and a running sum of squares rather than a running mean and variance, you’ll get less rounding errors that way.
Plus, its really obvious how compute the sum and sum of squares for z that way.
And if you don’t see how to compute the variance from the sum, sum of squares and count: look at the definition of variance, and do some binomial expansion. Remember that the summation distributes, and since μ is constant over the sum, you can move it outside of the summation.