php-ml/docs/machine-learning/feature-extraction/token-count-vectorizer.md
Attila Bakos 7d5c6b15a4 Updates to the documentation (linguistic corrections) (#414)
* Fix typo in Features list

* Update distance.md documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation
2019-11-02 11:41:34 +01:00

2.6 KiB
Raw Blame History

Token Count Vectorizer

Transform a collection of text samples to a vector of token counts.

Constructor Parameters

  • $tokenizer (Tokenizer) - tokenizer object (see below)
  • $minDF (float) - ignore tokens that have a samples frequency strictly lower than the given threshold. This value is also called cut-off in the literature. (default 0)
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WhitespaceTokenizer;

$vectorizer = new TokenCountVectorizer(new WhitespaceTokenizer());

Transformation

To transform a collection of text samples, use the transform method. Example:

$samples = [
    'Lorem ipsum dolor sit amet dolor',
    'Mauris placerat ipsum dolor',
    'Mauris diam eros fringilla diam',
];

$vectorizer = new TokenCountVectorizer(new WhitespaceTokenizer());

// Build the dictionary.
$vectorizer->fit($samples);

// Transform the provided text samples into a vectorized list.
$vectorizer->transform($samples);
// return $samples = [
//    [0 => 1, 1 => 1, 2 => 2, 3 => 1, 4 => 1],
//    [5 => 1, 6 => 1, 1 => 1, 2 => 1],
//    [5 => 1, 7 => 2, 8 => 1, 9 => 1],
//];

Vocabulary

You can extract vocabulary using the getVocabulary() method. Example:

$vectorizer->getVocabulary();
// return $vocabulary = ['Lorem', 'ipsum', 'dolor', 'sit', 'amet', 'Mauris', 'placerat', 'diam', 'eros', 'fringilla'];

Tokenizers

  • WhitespaceTokenizer - select tokens by whitespace.
  • WordTokenizer - select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
  • NGramTokenizer - continuous sequence of characters of the specified length. They are useful for querying languages that dont use spaces or that have long compound words, like German.

NGramTokenizer

The NGramTokenizer tokenizer accepts the following parameters:

$minGram - minimum length of characters in a gram. Defaults to 1. $maxGram - maximum length of characters in a gram. Defaults to 2.

use Phpml\Tokenization\NGramTokenizer;

$tokenizer = new NGramTokenizer(1, 2);

$tokenizer->tokenize('Quick Fox');

// returns ['Q', 'u', 'i', 'c', 'k', 'Qu', 'ui', 'ic', 'ck', 'F', 'o', 'x', 'Fo', 'ox']

NGramWordTokenizer

The NGramWordTokenizer tokenizer accepts the following parameters:

$minGram - minimum length of characters in a gram. Defaults to 1. $maxGram - maximum length of characters in a gram. Defaults to 2.

use Phpml\Tokenization\NGramWordTokenizer;

$tokenizer = new NGramWordTokenizer(1, 2);

$tokenizer->tokenize('very quick fox');

// returns ['very', 'quick', 'fox', 'very quick', 'quick fox']