php-ml/docs/machine-learning/feature-extraction/token-count-vectorizer.md
KenorFR cefb4fc7a7 Ngram word (#370)
* Add NGramWordTokenizer

* Update doc
Add test
Check coding standards
2019-04-05 21:23:09 +02:00

91 lines
2.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Token Count Vectorizer
Transform a collection of text samples to a vector of token counts.
### Constructor Parameters
* $tokenizer (Tokenizer) - tokenizer object (see below)
* $minDF (float) - ignore tokens that have a samples frequency strictly lower than the given threshold. This value is also called cut-off in the literature. (default 0)
```
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WhitespaceTokenizer;
$vectorizer = new TokenCountVectorizer(new WhitespaceTokenizer());
```
### Transformation
To transform a collection of text samples use `transform` method. Example:
```
$samples = [
'Lorem ipsum dolor sit amet dolor',
'Mauris placerat ipsum dolor',
'Mauris diam eros fringilla diam',
];
$vectorizer = new TokenCountVectorizer(new WhitespaceTokenizer());
// Build the dictionary.
$vectorizer->fit($samples);
// Transform the provided text samples into a vectorized list.
$vectorizer->transform($samples);
// return $samples = [
// [0 => 1, 1 => 1, 2 => 2, 3 => 1, 4 => 1],
// [5 => 1, 6 => 1, 1 => 1, 2 => 1],
// [5 => 1, 7 => 2, 8 => 1, 9 => 1],
//];
```
### Vocabulary
You can extract vocabulary using `getVocabulary()` method. Example:
```
$vectorizer->getVocabulary();
// return $vocabulary = ['Lorem', 'ipsum', 'dolor', 'sit', 'amet', 'Mauris', 'placerat', 'diam', 'eros', 'fringilla'];
```
### Tokenizers
* WhitespaceTokenizer - select tokens by whitespace.
* WordTokenizer - select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
* NGramTokenizer - continuous sequence of characters of the specified length. They are useful for querying languages that dont use spaces or that have long compound words, like German.
**NGramTokenizer**
The NGramTokenizer tokenizer accepts the following parameters:
`$minGram` - minimum length of characters in a gram. Defaults to 1.
`$maxGram` - maximum length of characters in a gram. Defaults to 2.
```php
use Phpml\Tokenization\NGramTokenizer;
$tokenizer = new NGramTokenizer(1, 2);
$tokenizer->tokenize('Quick Fox');
// returns ['Q', 'u', 'i', 'c', 'k', 'Qu', 'ui', 'ic', 'ck', 'F', 'o', 'x', 'Fo', 'ox']
```
**NGramWordTokenizer**
The NGramWordTokenizer tokenizer accepts the following parameters:
`$minGram` - minimum length of characters in a gram. Defaults to 1.
`$maxGram` - maximum length of characters in a gram. Defaults to 2.
```php
use Phpml\Tokenization\NGramWordTokenizer;
$tokenizer = new NGramWordTokenizer(1, 2);
$tokenizer->tokenize('very quick fox');
// returns ['very', 'quick', 'fox', 'very quick', 'quick fox']
```