DCPY.AUTOCLUST(max_clusters, columns)
Automated clustering uses K-means algorithm to cluster the data. It starts with creating a model with two clusters and continues up to a specified maximum number of clusters, evaluating clustering quality after each model by using Calinski-Harabasz index. If the value of this index is lower than a preceding model, the preceding model is used for optimal clustering (the first local maximum of Calinski-Harabasz index).
Parameters
- 
                                                            
max_clusters – The maximum number of allowed clusters, integer (default 10).
 - 
                                                            
columns – Columns to be used for clustering.
 
Example: DCPY.AUTOCLUST(10, sum([Gross Sales]), sum([No of customers])) used as a calculation for the Color field of the Scatterplot visualization.
Input data
- Numeric variables are automatically scaled to zero mean and unit variance.
 - Character variables are transformed to numeric values using one-hot encoding.
 - Dates are treated as character variables, so they are also one-hot encoded.
 - Size of input data is not limited, but many categories in character or date variables increase rapidly the dimensionality.
 - Rows that contain missing values in any of their columns are dropped.
 
Result
- 
                                                            
Column of integer values starting with 1, where each number corresponds to a cluster assigned to each record (row) by the algorithm.
 - 
                                                            
Rows that were dropped (due to missing values) are not assigned to any cluster.
 
Key usage points
- 
                                                            
Use it when you want a quick clustering without a specific number of clusters, or without any knowledge about underlying data.
 - 
                                                            
Same assumptions and advantages as for K-means algorithm apply. For details, see DCPY.KMEANSCLUST(n_clusters, random_state, init, n_init, max_iter, columns).
 - 
                                                            
First local maximum of Calinski-Harabasz index may create clusters that are not optimal.
 - 
                                                            
Depending on the data distribution and absence of natural clusters, Calinski-Harabasz index might be often highest for the model with maximum number of clusters specified, causing also sub-optimal results.
 
For the whole list of algorithms, see Data science built-in algorithms.