Arkadiusz Kondas 3ba35918a3
Implement VarianceThreshold - simple baseline approach to feature selection. (#228)
* Add sum of squares deviations

* Calculate population variance

* Add VarianceThreshold - feature selection transformer

* Add docs about VarianceThreshold

* Add missing code for pipeline usage
2018-02-10 18:07:09 +01:00

61 lines
1.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Variance Threshold
`VarianceThreshold` is a simple baseline approach to feature selection.
It removes all features whose variance doesnt meet some threshold.
By default, it removes all zero-variance features, i.e. features that have the same value in all samples.
## Constructor Parameters
* $threshold (float) - features with a variance lower than this threshold will be removed (default 0.0)
```php
use Phpml\FeatureSelection\VarianceThreshold;
$transformer = new VarianceThreshold(0.15);
```
## Example of use
As an example, suppose that we have a dataset with boolean features and
we want to remove all features that are either one or zero (on or off)
in more than 80% of the samples.
Boolean features are Bernoulli random variables, and the variance of such
variables is given by
```
Var[X] = p(1 - p)
```
so we can select using the threshold .8 * (1 - .8):
```php
use Phpml\FeatureSelection\VarianceThreshold;
$samples = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]];
$transformer = new VarianceThreshold(0.8 * (1 - 0.8));
$transformer->fit($samples);
$transformer->transform($samples);
/*
$samples = [[0, 1], [1, 0], [0, 0], [1, 1], [1, 0], [1, 1]];
*/
```
## Pipeline
`VarianceThreshold` implements `Transformer` interface so it can be used as part of pipeline:
```php
use Phpml\FeatureSelection\VarianceThreshold;
use Phpml\Classification\SVC;
use Phpml\FeatureExtraction\TfIdfTransformer;
use Phpml\Pipeline;
$transformers = [
new TfIdfTransformer(),
new VarianceThreshold(0.1)
];
$estimator = new SVC();
$pipeline = new Pipeline($transformers, $estimator);
```