mirror of https://github.com/Llewellynvdm/php-ml.git synced 2024-09-23 20:49:03 +00:00

Attila Bakos 7d5c6b15a4 Updates to the documentation (linguistic corrections) (#414 )

* Fix typo in Features list

* Update distance.md documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

* Fix grammatical mistakes in documentation

2019-11-02 11:41:34 +01:00

1.5 KiB

Raw Blame History

K-means clustering

The K-Means algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified.

Constructor Parameters

$clustersNumber - number of clusters to find
$initialization - initialization method, default kmeans++ (see below)

$kmeans = new KMeans(2);
$kmeans = new KMeans(4, KMeans::INIT_RANDOM);

Clustering

To divide the samples into clusters, simply use the cluster method. It returns the array of clusters with samples inside.

$samples = [[1, 1], [8, 7], [1, 2], [7, 8], [2, 1], [8, 9]];
Or if you need to keep your identifiers along with yours samples you can use array keys as labels.
$samples = [ 'Label1' => [1, 1], 'Label2' => [8, 7], 'Label3' => [1, 2]];

$kmeans = new KMeans(2);
$kmeans->cluster($samples);
// return [0=>[[1, 1], ...], 1=>[[8, 7], ...]] or [0=>['Label1' => [1, 1], 'Label3' => [1, 2], ...], 1=>['Label2' => [8, 7], ...]]

Initialization methods

kmeans++ (default)

K-means++ method selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. It uses the DASV seeding method consists of finding good initial centroids for the clusters.

random

Random initialization method chooses completely random centroid. It gets the space boundaries to avoid placing cluster centroids too far from samples data.

1.5 KiB Raw Blame History