MLLIB.BICLUSTER(imputer, n_clusters, seed, columns)

Bisecting K-means is a kind of hierarchical clustering using divisive (top-down approach), where all observations start in one cluster, and splits are performed recursively as it moves down the hierarchy. The splits are done with regular K-means with K = 2 on a cluster with highest SSE (sum of squared errors). The algorithm is executed with 20 iterations to split clusters.

Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.

Parameters
  • imputer – strategy for dealing with null values:

    • 0 – Replace null values with ‘0'

    • 1 – Assign null values to a designated ‘-1' cluster

  • number_of_clusters – Number of clusters which the algorithm should find, integer.

  • seed – Random seed, integer.

  • columns – Dataset columns or custom calculations.

Example: MLLIB.BICLUSTER(0, 3, 555, sum([Gross Sales]), sum([No of customers])) used as a calculation for the Color field of the Scatterplot visualization.

Input data
  • Size of input data is not limited.
  • Without missing values.
  • Character variables are transformed to numeric with label encoding.
Result
  • Column of integer values starting with 0, where each number corresponds to a cluster assigned to each record (row) by the Bisecting K-means algorithm.
Key usage points
  • Less sensitivity to initialization than regular K-means.
  • Tends to produce clusters of similar sizes, where K-means often produces null clusters when k is large.
  • Lower computational time.
  • Use it when you want to avoid convergence in local minimum.
Drawbacks
  • If the number of clusters is not selected properly, it will cause a large deviation between the results and ideal clustering results.

For the whole list of algorithms, see Data science built-in algorithms.