K-means

K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a center (also known as a prototype), and each data point is assigned to a cluster with the nearest center.

The K-means clustering used in our package is Yinyang K-means which has less runtime and memory usage on large datasets. Traditional K-means calculates the distance from all data points to centroids for each iteration. Yinyang K-means uses triangular inequalities to construct and maintain upper and lower bounds on the distances of data points from the centroids, with global and local filtering to minimize unneeded calculations.

PhyloClustering.kmeans_label — Function

kmeans_label(tree::AbstractMatrix{<:Real}, n::Int64; init::String="k-means++", rng::AbstractRNG=Random.GLOBAL_RNG)

Get predicted labels from Yinyang K-means clustering for a group of phylogenetic trees.

Arguments

tree: a B * N tree Matrix (each column of tree Matrix is a B-dimensional tree in bipartiton format).
n: the number of clusters.
init(defaults to k-means++): is one of the algorithms used for initialization. Alternatively, one can use rand to choose random points for init.
rng: RNG object.

Output

A Vector object with length of N containing predicted labels for each tree (the cluster it belongs to).

source

Example

using PhyloClustering, PhyloNetworks, StableRNGs
# use RNG for stable result
rng = StableRNG(1)

# read trees with 4-taxa in Newick format using PhyloNetworks
trees = readMultiTopology("../data/data.trees");

# convert trees to Bipartition foramt and embed them via split-weight embedding
trees = split_weight(trees, 4)

200×7 Matrix{Float64}:
 2.073  2.073  2.492  7.764  0.419  0.0   0.0
 1.084  1.084  2.035  6.046  0.95   0.0   0.0
 1.234  1.234  2.1    6.221  0.865  0.0   0.0
 2.457  2.064  2.064  6.042  0.0    0.0   0.393
 1.692  1.692  2.836  5.67   1.144  0.0   0.0
 2.664  2.774  2.664  6.598  0.0    0.11  0.0
 1.085  1.085  2.172  5.879  1.087  0.0   0.0
 1.78   1.78   2.054  6.221  0.273  0.0   0.0
 2.123  2.083  2.083  8.856  0.0    0.0   0.04
 1.155  1.155  2.113  7.139  0.959  0.0   0.0
 ⋮                                  ⋮     
 0.663  0.663  3.492  3.492  3.853  0.0   0.0
 1.607  1.607  3.59   3.59   3.143  0.0   0.0
 0.887  0.887  3.826  3.826  3.513  0.0   0.0
 1.053  1.053  3.206  3.206  4.783  0.0   0.0
 0.592  0.592  3.322  3.322  7.53   0.0   0.0
 0.718  0.718  3.335  3.335  5.665  0.0   0.0
 1.46   1.46   3.241  3.241  3.417  0.0   0.0
 0.709  0.709  3.71   3.71   4.774  0.0   0.0
 1.113  1.113  3.402  3.402  5.019  0.0   0.0

Standardize the data and input them into Yinyang K-means clustering.

tree = standardize_tree(trees);
label = kmeans_label(tree, 2, rng=rng)

200-element Vector{Int64}:
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 ⋮
 1
 1
 1
 1
 1
 1
 1
 1
 1

Reference

The implementation of Yinyang K-means is provided by ParallelKMeans.jl.

Yufei Ding et al. 2015. Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup. Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015