php-ml/tests/Tokenization/WhitespaceTokenizerTest.php

<?php

declare(strict_types=1);

namespace Phpml\Tests\Tokenization;

use Phpml\Tokenization\WhitespaceTokenizer;

class WhitespaceTokenizerTest extends TokenizerTest
{
    public function testTokenizationOnAscii(): void
    {
        $tokenizer = new WhitespaceTokenizer();

        $tokens = ['Lorem', 'ipsum-dolor', 'sit', 'amet,', 'consectetur/adipiscing', 'elit.',
            'Cras', 'consectetur,', 'dui', 'et', 'lobortis;auctor.',
            'Nulla', 'vitae', ',.,/', 'congue', 'lorem.', ];

        self::assertEquals($tokens, $tokenizer->tokenize($this->getSimpleText()));
    }

    public function testTokenizationOnUtf8(): void
    {
        $tokenizer = new WhitespaceTokenizer();

        $tokens = ['鋍鞎', '鳼', '鞮鞢騉', '袟袘觕,', '炟砏', '蒮', '謺貙蹖', '偢偣唲', '蒛', '箷箯緷', '鑴鱱爧', '覮轀,',
            '剆坲', '煘煓瑐', '鬐鶤鶐', '飹勫嫢', '銪', '餀', '枲柊氠', '鍎鞚韕', '焲犈,',
            '殍涾烰', '齞齝囃', '蹅輶', '鄜,', '孻憵', '擙樲橚', '藒襓謥', '岯岪弨', '蒮', '廞徲', '孻憵懥', '趡趛踠', '槏', ];

        self::assertEquals($tokens, $tokenizer->tokenize($this->getUtf8Text()));
    }
}
create whitespace tokenizer 2016-05-02 22:33:18 +00:00			`<?php`

Update php-cs-fixer 2016-11-20 21:53:17 +00:00			`declare(strict_types=1);`
create whitespace tokenizer 2016-05-02 22:33:18 +00:00
Add PHPStan and level to max (#168) * tests: update to PHPUnit 6.0 with rector * fix namespaces on tests * composer + tests: use standard test namespace naming * update travis * resolve conflict * phpstan lvl 2 * phpstan lvl 3 * phpstan lvl 4 * phpstan lvl 5 * phpstan lvl 6 * phpstan lvl 7 * level max * resolve conflict * [cs] clean empty docs * composer: bump to PHPUnit 6.4 * cleanup * composer + travis: add phpstan * phpstan lvl 1 * composer: update dev deps * phpstan fixes * update Contributing with new tools * docs: link fixes, PHP version update * composer: drop php-cs-fixer, cs already handled by ecs * ecs: add old set rules * [cs] apply rest of rules 2018-01-06 12:09:33 +00:00			`namespace Phpml\Tests\Tokenization;`
create whitespace tokenizer 2016-05-02 22:33:18 +00:00
			`use Phpml\Tokenization\WhitespaceTokenizer;`

Provide a new NGramTokenizer with minGram and maxGram support (#350) * Issue #349: Provide a new NGramTokenizer. * Issue #349: Add tests. * Fixes from code review. * Implement NGramTokenizer with min and max gram support * Add missing tests for ngram * Add info about NGramTokenizer to docs and readme * Add performance test for tokenization 2019-02-15 16:31:10 +00:00			`class WhitespaceTokenizerTest extends TokenizerTest`
create whitespace tokenizer 2016-05-02 22:33:18 +00:00			`{`
Upgrade to PHP 7.1 (#150) * upgrade to PHP 7.1 * bump travis and composer to PHP 7.1 * fix tests 2017-11-14 20:21:23 +00:00			`public function testTokenizationOnAscii(): void`
create whitespace tokenizer 2016-05-02 22:33:18 +00:00			`{`
			`$tokenizer = new WhitespaceTokenizer();`

Provide a new NGramTokenizer with minGram and maxGram support (#350) * Issue #349: Provide a new NGramTokenizer. * Issue #349: Add tests. * Fixes from code review. * Implement NGramTokenizer with min and max gram support * Add missing tests for ngram * Add info about NGramTokenizer to docs and readme * Add performance test for tokenization 2019-02-15 16:31:10 +00:00			`$tokens = ['Lorem', 'ipsum-dolor', 'sit', 'amet,', 'consectetur/adipiscing', 'elit.',`
			`'Cras', 'consectetur,', 'dui', 'et', 'lobortis;auctor.',`
			`'Nulla', 'vitae', ',.,/', 'congue', 'lorem.', ];`
create whitespace tokenizer 2016-05-02 22:33:18 +00:00
Provide a new NGramTokenizer with minGram and maxGram support (#350) * Issue #349: Provide a new NGramTokenizer. * Issue #349: Add tests. * Fixes from code review. * Implement NGramTokenizer with min and max gram support * Add missing tests for ngram * Add info about NGramTokenizer to docs and readme * Add performance test for tokenization 2019-02-15 16:31:10 +00:00			`self::assertEquals($tokens, $tokenizer->tokenize($this->getSimpleText()));`
create whitespace tokenizer 2016-05-02 22:33:18 +00:00			`}`

Upgrade to PHP 7.1 (#150) * upgrade to PHP 7.1 * bump travis and composer to PHP 7.1 * fix tests 2017-11-14 20:21:23 +00:00			`public function testTokenizationOnUtf8(): void`
create whitespace tokenizer 2016-05-02 22:33:18 +00:00			`{`
			`$tokenizer = new WhitespaceTokenizer();`

			`$tokens = ['鋍鞎', '鳼', '鞮鞢騉', '袟袘觕,', '炟砏', '蒮', '謺貙蹖', '偢偣唲', '蒛', '箷箯緷', '鑴鱱爧', '覮轀,',`
Update to EasyCodingStandard 4 (#273) * update ECS config to v4 * composer: require Symplify 4 * apply coding-standard: use constants over functions, protected setUp() in tests, array indentation * ecs: add false positive case * composer: update lock * bump to ECS 4.4 * update composer.lock * shorten ECS config name * ecs: ignore assignments in while() * fix cs 2018-06-15 05:57:45 +00:00			`'剆坲', '煘煓瑐', '鬐鶤鶐', '飹勫嫢', '銪', '餀', '枲柊氠', '鍎鞚韕', '焲犈,',`
			`'殍涾烰', '齞齝囃', '蹅輶', '鄜,', '孻憵', '擙樲橚', '藒襓謥', '岯岪弨', '蒮', '廞徲', '孻憵懥', '趡趛踠', '槏', ];`
create whitespace tokenizer 2016-05-02 22:33:18 +00:00
Provide a new NGramTokenizer with minGram and maxGram support (#350) * Issue #349: Provide a new NGramTokenizer. * Issue #349: Add tests. * Fixes from code review. * Implement NGramTokenizer with min and max gram support * Add missing tests for ngram * Add info about NGramTokenizer to docs and readme * Add performance test for tokenization 2019-02-15 16:31:10 +00:00			`self::assertEquals($tokens, $tokenizer->tokenize($this->getUtf8Text()));`
create whitespace tokenizer 2016-05-02 22:33:18 +00:00			`}`
			`}`