Outlier detection

Note: This is still a beta functionality.

An outlier is a data point that is significantly different from rest of the data. Presence of outliers in training data can deteriorate the generalization performance of machine learning models. In this aspect, we aim to remove outliers from signals.

Usage

signal.process_outliers(strategy='online', threshold=3, above_threshold=False, output_scores=False, impute=False)

This method processes a signal and removes or imputes outliers.

Note that the ‘online’ strategy is deprecated and will be removed in the future. This strategy also doesn’t support the arguments above_threshold, output_scores and impute.

Parameters:
  • signal – The signal to process.

  • strategy – Two strategies are available at the moment: 1, ‘online’: it looks at sudden changes in a signal to classify a datapoint as outlier. 2, ‘residual’: it looks at residuals from a regression model to detect an outlier.

  • threshold – It acts as a threshold on the decision scores for outliers above which a data point is flagged as an outlier. The scores are the z-scores - the number of standard deviations from the expected value.

  • above_threshold – If set to True, outputs when the scores are above the threshold, instead of below. This means only the outliers are kept and other values removed.

  • output_scores – If set to True, outputs the z-scores instead of the data values.

  • impute – If set to True, the missing values are imputed with the expected value, instead of being removed. This setting cannot be combined with above_threshold or output_scores.

To process outliers in sales signal, call:

actual('sales').process_outliers()

To specify a particular strategy and threshold, we can use:

actual('sales').process_outliers(strategy='online', threshold=3)