K-means

K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a center (also known as a prototype), and each data point is assigned to a cluster with the nearest center.

The K-means clustering used in our package is Yinyang K-means which has less runtime and memory usage on large datasets. Traditional K-means calculates the distance from all data points to centroids for each iteration. Yinyang K-means uses triangular inequalities to construct and maintain upper and lower bounds on the distances of data points from the centroids, with global and local filtering to minimize unneeded calculations.

PhyloClustering.kmeans_labelFunction
kmeans_label(tree::AbstractMatrix{<:Real}, n::Int64; init::String="k-means++", rng::AbstractRNG=Random.GLOBAL_RNG)

Get predicted labels from Yinyang K-means clustering for a group of phylogenetic trees.

Arguments

  • tree: a B * N tree Matrix (each column of tree Matrix is a B-dimensional tree in bipartiton format).
  • n: the number of clusters.
  • init(defaults to k-means++): is one of the algorithms used for initialization. Alternatively, one can use rand to choose random points for init.
  • rng: RNG object.

Output

A Vector object with length of N containing predicted labels for each tree (the cluster it belongs to).

source

Example

using PhyloClustering, PhyloNetworks, StableRNGs
# use RNG for stable result
rng = StableRNG(1)

# read trees with 4-taxa in Newick format using PhyloNetworks
trees = readMultiTopology("../data/data.trees");

# convert trees to Bipartition foramt and embed them via split-weight embedding
trees = split_weight(trees, 4)
200×7 Matrix{Float64}:
 2.073  2.073  2.492  7.764  0.419  0.0   0.0
 1.084  1.084  2.035  6.046  0.95   0.0   0.0
 1.234  1.234  2.1    6.221  0.865  0.0   0.0
 2.457  2.064  2.064  6.042  0.0    0.0   0.393
 1.692  1.692  2.836  5.67   1.144  0.0   0.0
 2.664  2.774  2.664  6.598  0.0    0.11  0.0
 1.085  1.085  2.172  5.879  1.087  0.0   0.0
 1.78   1.78   2.054  6.221  0.273  0.0   0.0
 2.123  2.083  2.083  8.856  0.0    0.0   0.04
 1.155  1.155  2.113  7.139  0.959  0.0   0.0
 ⋮                                  ⋮     
 0.663  0.663  3.492  3.492  3.853  0.0   0.0
 1.607  1.607  3.59   3.59   3.143  0.0   0.0
 0.887  0.887  3.826  3.826  3.513  0.0   0.0
 1.053  1.053  3.206  3.206  4.783  0.0   0.0
 0.592  0.592  3.322  3.322  7.53   0.0   0.0
 0.718  0.718  3.335  3.335  5.665  0.0   0.0
 1.46   1.46   3.241  3.241  3.417  0.0   0.0
 0.709  0.709  3.71   3.71   4.774  0.0   0.0
 1.113  1.113  3.402  3.402  5.019  0.0   0.0

Standardize the data and input them into Yinyang K-means clustering.

tree = standardize_tree(trees);
label = kmeans_label(tree, 2, rng=rng)
200-element Vector{Int64}:
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 ⋮
 1
 1
 1
 1
 1
 1
 1
 1
 1

We can visualize the result using build-in function plot_clusters.

plot_clusters(trees', label)
Example block output

Reference

The implementation of Yinyang K-means is provided by ParallelKMeans.jl.

Yufei Ding et al. 2015. Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup. Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015