MLLIB.CLUSTER(imputer, n_clusters, n_iter, columns)
Kmeans is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. It includes a paralleled variant of the kmeans++ for clusters initialization Kmeans method.
Parameters

imputer – Strategy for dealing with null values:

0 – Replace null values with ‘0'

1 – Assign null values to a designated ‘1' cluster


number_of_clusters – Number of clusters which the algorithm should find, integer.

number_of_iterations – Maximum number of iterations (recalculations of centroids) in a single run, integer.

columns – Dataset columns or custom calculations.
Example: MLLIB.CLUSTER(0, 3, 20, sum([Gross Sales]), sum([No of customers])) used as a calculation for the Color field of the Scatterplot visualization.
Input data
 Size of input data is not limited.
 Without missing values.
 Character variables are transformed to numeric with label encoding.
Result
 Column of integer values starting with 0, where each number corresponds to a cluster assigned to each record (row) by the Kmeans algorithm.
Key usage points
 Fast and computationally efficient, very high scalability.
 Practically works well, even if some of its assumptions are broken.
 Generalpurpose clustering algorithm.
 When the approximate number of clusters is known.
 When there is a low number of outliers.
 When the clusters are spherical, with approximately same number of observations, density and variance.
 Euclidean distances tend to be more inflated with higher number of variables (curse of dimensionality).
 By calculating Euclidean distance, the algorithm makes assumption of only numeric input variables. Onehot encoding of categorical variables is a workaround suitable for a relatively low number of categories to encode.
 Kmeans makes an assumption that we deal with spherical clusters and that each cluster has roughly equal numbers of observations, density and variance, otherwise the results might be misleading.
 It always finds clusters in the data, even if no natural clusters are present.
 All data points are assigned to a cluster, even though some of them might be just random noise.
 Sensitivity to outliers.
For the whole list of algorithms, see Data science builtin algorithms.