mirror of
https://github.com/Llewellynvdm/php-ml.git
synced 2024-11-16 10:15:13 +00:00
7d5c6b15a4
* Fix typo in Features list * Update distance.md documentation * Fix grammatical mistakes in documentation * Fix grammatical mistakes in documentation * Fix grammatical mistakes in documentation * Fix grammatical mistakes in documentation * Fix grammatical mistakes in documentation * Fix grammatical mistakes in documentation * Fix grammatical mistakes in documentation * Fix grammatical mistakes in documentation * Fix grammatical mistakes in documentation
2.6 KiB
2.6 KiB
Token Count Vectorizer
Transform a collection of text samples to a vector of token counts.
Constructor Parameters
- $tokenizer (Tokenizer) - tokenizer object (see below)
- $minDF (float) - ignore tokens that have a samples frequency strictly lower than the given threshold. This value is also called cut-off in the literature. (default 0)
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WhitespaceTokenizer;
$vectorizer = new TokenCountVectorizer(new WhitespaceTokenizer());
Transformation
To transform a collection of text samples, use the transform
method. Example:
$samples = [
'Lorem ipsum dolor sit amet dolor',
'Mauris placerat ipsum dolor',
'Mauris diam eros fringilla diam',
];
$vectorizer = new TokenCountVectorizer(new WhitespaceTokenizer());
// Build the dictionary.
$vectorizer->fit($samples);
// Transform the provided text samples into a vectorized list.
$vectorizer->transform($samples);
// return $samples = [
// [0 => 1, 1 => 1, 2 => 2, 3 => 1, 4 => 1],
// [5 => 1, 6 => 1, 1 => 1, 2 => 1],
// [5 => 1, 7 => 2, 8 => 1, 9 => 1],
//];
Vocabulary
You can extract vocabulary using the getVocabulary()
method. Example:
$vectorizer->getVocabulary();
// return $vocabulary = ['Lorem', 'ipsum', 'dolor', 'sit', 'amet', 'Mauris', 'placerat', 'diam', 'eros', 'fringilla'];
Tokenizers
- WhitespaceTokenizer - select tokens by whitespace.
- WordTokenizer - select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
- NGramTokenizer - continuous sequence of characters of the specified length. They are useful for querying languages that don’t use spaces or that have long compound words, like German.
NGramTokenizer
The NGramTokenizer tokenizer accepts the following parameters:
$minGram
- minimum length of characters in a gram. Defaults to 1.
$maxGram
- maximum length of characters in a gram. Defaults to 2.
use Phpml\Tokenization\NGramTokenizer;
$tokenizer = new NGramTokenizer(1, 2);
$tokenizer->tokenize('Quick Fox');
// returns ['Q', 'u', 'i', 'c', 'k', 'Qu', 'ui', 'ic', 'ck', 'F', 'o', 'x', 'Fo', 'ox']
NGramWordTokenizer
The NGramWordTokenizer tokenizer accepts the following parameters:
$minGram
- minimum length of characters in a gram. Defaults to 1.
$maxGram
- maximum length of characters in a gram. Defaults to 2.
use Phpml\Tokenization\NGramWordTokenizer;
$tokenizer = new NGramWordTokenizer(1, 2);
$tokenizer->tokenize('very quick fox');
// returns ['very', 'quick', 'fox', 'very quick', 'quick fox']