Arkadiusz Kondas 3ba35918a3
Implement VarianceThreshold - simple baseline approach to feature selection. (#228)
* Add sum of squares deviations

* Calculate population variance

* Add VarianceThreshold - feature selection transformer

* Add docs about VarianceThreshold

* Add missing code for pipeline usage
2018-02-10 18:07:09 +01:00

1.6 KiB
Raw Blame History

Variance Threshold

VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesnt meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

Constructor Parameters

  • $threshold (float) - features with a variance lower than this threshold will be removed (default 0.0)
use Phpml\FeatureSelection\VarianceThreshold;

$transformer = new VarianceThreshold(0.15);

Example of use

As an example, suppose that we have a dataset with boolean features and we want to remove all features that are either one or zero (on or off) in more than 80% of the samples. Boolean features are Bernoulli random variables, and the variance of such variables is given by

Var[X] = p(1 - p)

so we can select using the threshold .8 * (1 - .8):

use Phpml\FeatureSelection\VarianceThreshold;

$samples = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]];
$transformer = new VarianceThreshold(0.8 * (1 - 0.8));

$transformer->fit($samples);
$transformer->transform($samples);

/*
$samples = [[0, 1], [1, 0], [0, 0], [1, 1], [1, 0], [1, 1]];
*/

Pipeline

VarianceThreshold implements Transformer interface so it can be used as part of pipeline:

use Phpml\FeatureSelection\VarianceThreshold;
use Phpml\Classification\SVC;
use Phpml\FeatureExtraction\TfIdfTransformer;
use Phpml\Pipeline;

$transformers = [
    new TfIdfTransformer(),
    new VarianceThreshold(0.1)
];
$estimator = new SVC();

$pipeline = new Pipeline($transformers, $estimator);