DCPY.AUTOCLUST(max_clusters, columns)
Automated clustering uses K-means algorithm to cluster the data. It starts with creating a model with two clusters and continues up to a specified maximum number of clusters, evaluating clustering quality after each model by using Calinski-Harabasz index. If the value of this index is lower than a preceding model, the preceding model is used for optimal clustering (the first local maximum of Calinski-Harabasz index).
Parameters
-
max_clusters – The maximum number of allowed clusters, integer (default 10).
-
columns – Columns to be used for clustering.
Example: DCPY.AUTOCLUST(10, sum([Gross Sales]), sum([No of customers])) used as a calculation for the Color field of the Scatterplot visualization.
Input data
- Numeric variables are automatically scaled to zero mean and unit variance.
- Character variables are transformed to numeric values using one-hot encoding.
- Dates are treated as character variables, so they are also one-hot encoded.
- Size of input data is not limited, but many categories in character or date variables increase rapidly the dimensionality.
- Rows that contain missing values in any of their columns are dropped.
Result
-
Column of integer values starting with 1, where each number corresponds to a cluster assigned to each record (row) by the algorithm.
-
Rows that were dropped (due to missing values) are not assigned to any cluster.
Key usage points
-
Use it when you want a quick clustering without a specific number of clusters, or without any knowledge about underlying data.
-
Same assumptions and advantages as for K-means algorithm apply. For details, see DCPY.KMEANSCLUST(n_clusters, random_state, init, n_init, max_iter, columns).
-
First local maximum of Calinski-Harabasz index may create clusters that are not optimal.
-
Depending on the data distribution and absence of natural clusters, Calinski-Harabasz index might be often highest for the model with maximum number of clusters specified, causing also sub-optimal results.
For the whole list of algorithms, see Data science built-in algorithms.