ICS for multivariate outlier detection with application to quality control

Highlights

Detecting automatically multivariate outliers in high reliability standards fields.

Combining the advantages of Mahalanobis distance and Principal Component Analysis.

Simple and efficient procedure in the context of a small proportion of outliers.

Reducing the number of false positives compared to competitors.

An R package available: ICSOutlier.

Abstract

In high reliability standards fields such as automotive, avionics or aerospace, the detection of anomalies is crucial. An efficient methodology for automatically detecting multivariate outliers is introduced. It takes advantage of the remarkable properties of the Invariant Coordinate Selection (ICS) method which leads to an affine invariant coordinate system in which the Euclidian distance corresponds to a Mahalanobis Distance (MD) in the original coordinates. The limitations of MD are highlighted using theoretical arguments in a context where the dimension of the data is large. Owing to the resulting dimension reduction, ICS is expected to improve the power of outlier detection rules such as MD-based criteria. The paper includes practical guidelines for using ICS in the context of a small proportion of outliers. The use of the regular covariance matrix and the so called matrix of fourth moments as the scatter pair is recommended. This choice combines the simplicity of implementation together with the possibility to derive theoretical results. The selection of relevant invariant components through parallel analysis and normality tests is addressed. A simulation study confirms the good properties of the proposal and provides a comparison with Principal Component Analysis and MD. The performance of the proposal is also evaluated on two real data sets using a user-friendly R package accompanying the paper.

Keywords

Affine invariance
Mahalanobis distance
Principal component analysis
Scatter estimators
Unsupervised outlier identification

Supplementary material is provided. It contains some scatterplot matrices to visualize the six simulated data sets and the R code to generate these data sets. It also includes the R code to reproduce the results of Table 4 for the Reliability data and the HTP data sets.

View full text