MLLIB.CLUSTER(imputer, n_clusters, n_iter, columns)
K-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. It includes a paralleled variant of the k-means++ for clusters initialization K-means method.
Parameters
- 
                                                            
imputer – Strategy for dealing with null values:
- 
                                                                    
0 – Replace null values with ‘0'
 - 
                                                                    
1 – Assign null values to a designated ‘-1' cluster
 
 - 
                                                                    
 - 
                                                            
number_of_clusters – Number of clusters which the algorithm should find, integer.
 - 
                                                            
number_of_iterations – Maximum number of iterations (recalculations of centroids) in a single run, integer.
 - 
                                                            
columns – Dataset columns or custom calculations.
 
Example: MLLIB.CLUSTER(0, 3, 20, sum([Gross Sales]), sum([No of customers])) used as a calculation for the Color field of the Scatterplot visualization.
Input data
- Size of input data is not limited.
 - Without missing values.
 - Character variables are transformed to numeric with label encoding.
 
Result
- Column of integer values starting with 0, where each number corresponds to a cluster assigned to each record (row) by the K-means algorithm.
 
Key usage points
- Fast and computationally efficient, very high scalability.
 - Practically works well, even if some of its assumptions are broken.
 - General-purpose clustering algorithm.
 - When the approximate number of clusters is known.
 - When there is a low number of outliers.
 - When the clusters are spherical, with approximately same number of observations, density and variance.
 - Euclidean distances tend to be more inflated with higher number of variables (curse of dimensionality).
 - By calculating Euclidean distance, the algorithm makes assumption of only numeric input variables. One-hot encoding of categorical variables is a workaround suitable for a relatively low number of categories to encode.
 - K-means makes an assumption that we deal with spherical clusters and that each cluster has roughly equal numbers of observations, density and variance, otherwise the results might be misleading.
 - It always finds clusters in the data, even if no natural clusters are present.
 - All data points are assigned to a cluster, even though some of them might be just random noise.
 - Sensitivity to outliers.
 
For the whole list of algorithms, see Data science built-in algorithms.