DCPY.ENVELOPE(contamination, columns)
Elliptic Envelope is a multivariate outlier detection technique, which strongly assumes Gaussian distribution of underlying data. This assumption is used to identify outlying samples using robust covariance estimation.
Parameters
-
contamination – Approximate proportion of outliers in the dataset, which is used as a threshold for the decision function, float (0;1) (default 0.1).
-
columns – Dataset columns or custom calculations.
Example: DCPY.ENVELOPE(0.1, sum([Gross Sales]), sum([No of customers])) used as a calculation for the Color field of the Scatterplot visualization.
Input data
-
Numeric variables are automatically scaled to zero mean and unit variance.
-
Character variables are transformed to numeric values using one-hot encoding.
-
Dates are treated as character variables, so they are also one-hot encoded.
-
Size of input data is not limited, but many categories in character or date variables increase rapidly the dimensionality.
-
Rows that contain missing values in any of their columns are dropped.
Result
- Column of values 1 corresponding to inlier, and -1 corresponding to outlier.
- Rows that were dropped from input data due to containing missing values have missing value instead of assigned inlier/outlier value.
Key usage points
-
Data needs to be Gaussian distributed, otherwise it losses reliability.
-
Works well when the dataset does not contain many variables.
For the whole list of algorithms, see Data science built-in algorithms.