mirror of
https://github.com/Llewellynvdm/php-ml.git
synced 2024-11-14 09:24:06 +00:00
02dab41830
* Issue #349: Provide a new NGramTokenizer. * Issue #349: Add tests. * Fixes from code review. * Implement NGramTokenizer with min and max gram support * Add missing tests for ngram * Add info about NGramTokenizer to docs and readme * Add performance test for tokenization
2.2 KiB
2.2 KiB
Token Count Vectorizer
Transform a collection of text samples to a vector of token counts.
Constructor Parameters
- $tokenizer (Tokenizer) - tokenizer object (see below)
- $minDF (float) - ignore tokens that have a samples frequency strictly lower than the given threshold. This value is also called cut-off in the literature. (default 0)
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WhitespaceTokenizer;
$vectorizer = new TokenCountVectorizer(new WhitespaceTokenizer());
Transformation
To transform a collection of text samples use transform
method. Example:
$samples = [
'Lorem ipsum dolor sit amet dolor',
'Mauris placerat ipsum dolor',
'Mauris diam eros fringilla diam',
];
$vectorizer = new TokenCountVectorizer(new WhitespaceTokenizer());
// Build the dictionary.
$vectorizer->fit($samples);
// Transform the provided text samples into a vectorized list.
$vectorizer->transform($samples);
// return $samples = [
// [0 => 1, 1 => 1, 2 => 2, 3 => 1, 4 => 1],
// [5 => 1, 6 => 1, 1 => 1, 2 => 1],
// [5 => 1, 7 => 2, 8 => 1, 9 => 1],
//];
Vocabulary
You can extract vocabulary using getVocabulary()
method. Example:
$vectorizer->getVocabulary();
// return $vocabulary = ['Lorem', 'ipsum', 'dolor', 'sit', 'amet', 'Mauris', 'placerat', 'diam', 'eros', 'fringilla'];
Tokenizers
- WhitespaceTokenizer - select tokens by whitespace.
- WordTokenizer - select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
- NGramTokenizer - continuous sequence of characters of the specified length. They are useful for querying languages that don’t use spaces or that have long compound words, like German.
NGramTokenizer
The NGramTokenizer tokenizer accepts the following parameters:
$minGram
- minimum length of characters in a gram. Defaults to 1.
$maxGram
- maximum length of characters in a gram. Defaults to 2.
use Phpml\Tokenization\NGramTokenizer;
$tokenizer = new NGramTokenizer(1, 2);
$tokenizer->tokenize('Quick Fox');
// returns ['Q', 'u', 'i', 'c', 'k', 'Qu', 'ui', 'ic', 'ck', 'F', 'o', 'x', 'Fo', 'ox']