php-ml/docs/machine-learning/feature-extraction/token-count-vectorizer.md
Pol Dellaiera 02dab41830 Provide a new NGramTokenizer with minGram and maxGram support (#350)
* Issue #349: Provide a new NGramTokenizer.

* Issue #349: Add tests.

* Fixes from code review.

* Implement NGramTokenizer with min and max gram support

* Add missing tests for ngram

* Add info about NGramTokenizer to docs and readme

* Add performance test for tokenization
2019-02-15 17:31:10 +01:00

2.2 KiB
Raw Blame History

Token Count Vectorizer

Transform a collection of text samples to a vector of token counts.

Constructor Parameters

  • $tokenizer (Tokenizer) - tokenizer object (see below)
  • $minDF (float) - ignore tokens that have a samples frequency strictly lower than the given threshold. This value is also called cut-off in the literature. (default 0)
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WhitespaceTokenizer;

$vectorizer = new TokenCountVectorizer(new WhitespaceTokenizer());

Transformation

To transform a collection of text samples use transform method. Example:

$samples = [
    'Lorem ipsum dolor sit amet dolor',
    'Mauris placerat ipsum dolor',
    'Mauris diam eros fringilla diam',
];

$vectorizer = new TokenCountVectorizer(new WhitespaceTokenizer());

// Build the dictionary.
$vectorizer->fit($samples);

// Transform the provided text samples into a vectorized list.
$vectorizer->transform($samples);
// return $samples = [
//    [0 => 1, 1 => 1, 2 => 2, 3 => 1, 4 => 1],
//    [5 => 1, 6 => 1, 1 => 1, 2 => 1],
//    [5 => 1, 7 => 2, 8 => 1, 9 => 1],
//];

Vocabulary

You can extract vocabulary using getVocabulary() method. Example:

$vectorizer->getVocabulary();
// return $vocabulary = ['Lorem', 'ipsum', 'dolor', 'sit', 'amet', 'Mauris', 'placerat', 'diam', 'eros', 'fringilla'];

Tokenizers

  • WhitespaceTokenizer - select tokens by whitespace.
  • WordTokenizer - select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
  • NGramTokenizer - continuous sequence of characters of the specified length. They are useful for querying languages that dont use spaces or that have long compound words, like German.

NGramTokenizer

The NGramTokenizer tokenizer accepts the following parameters:

$minGram - minimum length of characters in a gram. Defaults to 1. $maxGram - maximum length of characters in a gram. Defaults to 2.

use Phpml\Tokenization\NGramTokenizer;

$tokenizer = new NGramTokenizer(1, 2);

$tokenizer->tokenize('Quick Fox');

// returns ['Q', 'u', 'i', 'c', 'k', 'Qu', 'ui', 'ic', 'ck', 'F', 'o', 'x', 'Fo', 'ox']