Category - math

Combining progressive variances

So, i ran across a math problem that i don’t know how to solve.

Recently, i was building a text-based categorizer for a client. During the feature selection phase, i needed to calculate the average and standard deviation for each potential feature. In my early runs, i implemented this straightforwardly by keeping the per-document counts in per-feature sequences. This worked fine for a couple dozen documents, but on the actual training set (which holds a few thousand documents), this expanded predictably rapidly. The sequences were proportionally longer, but the number of features (and thus, the number of sequences) also increased (sublinearly, but enough). Long story short, despite my beefy workstation, the sequences were eating through my 4G of RAM and larger swap and still demanding more….

Completed on October 3, 2007