php-ml/docs/machine-learning/feature-selection/selectkbest.md
2018-02-14 20:34:53 +01:00

2.5 KiB

SelectKBest

SelectKBest - select features according to the k highest scores.

Constructor Parameters

  • $k (int) - number of top features to select, rest will be removed (default: 10)
  • $scoringFunction (ScoringFunction) - function that take samples and targets and return array with scores (default: ANOVAFValue)
use Phpml\FeatureSelection\SelectKBest;

$transformer = new SelectKBest(2);

Example of use

As an example we can perform feature selection on Iris dataset to retrieve only the two best features as follows:

use Phpml\FeatureSelection\SelectKBest;
use Phpml\Dataset\Demo\IrisDataset;

$dataset = new IrisDataset();
$selector = new SelectKBest(2);
$selector->fit($samples = $dataset->getSamples(), $dataset->getTargets());
$selector->transform($samples);

/*
$samples[0] = [1.4, 0.2]; 
*/

Scores

You can get a array with the calculated score for each feature. A higher value means that a given feature is better suited for learning. Of course, the rating depends on the scoring function used.

use Phpml\FeatureSelection\SelectKBest;
use Phpml\Dataset\Demo\IrisDataset;

$dataset = new IrisDataset();
$selector = new SelectKBest(2);
$selector->fit($samples = $dataset->getSamples(), $dataset->getTargets());
$selector->scores();

/*
..array(4) {
  [0]=>
  float(119.26450218451)
  [1]=>
  float(47.364461402997)
  [2]=>
  float(1179.0343277002)
  [3]=>
  float(959.32440572573)
} 
*/

Scoring function

Available scoring functions:

For classification:

  • ANOVAFValue The one-way ANOVA tests the null hypothesis that 2 or more groups have the same population mean. The test is applied to samples from two or more groups, possibly with differing sizes.

For regression:

  • UnivariateLinearRegression
    Quick linear model for testing the effect of a single regressor, sequentially for many regressors. This is done in 2 steps:
      1. The cross correlation between each regressor and the target is computed, that is, ((X[:, i] - mean(X[:, i])) * (y - mean_y)) / (std(X[:, i]) *std(y)).
      1. It is converted to an F score

Pipeline

SelectKBest implements Transformer interface so it can be used as part of pipeline:

use Phpml\FeatureSelection\SelectKBest;
use Phpml\Classification\SVC;
use Phpml\FeatureExtraction\TfIdfTransformer;
use Phpml\Pipeline;

$transformers = [
    new TfIdfTransformer(),
    new SelectKBest(3)
];
$estimator = new SVC();

$pipeline = new Pipeline($transformers, $estimator);