KDD-2001 Tutorial

Data Mining for Outliers with Robust Statistics

R. Douglas Martin, University of Washington and Insightful Corp.


Abstract Presenter
Outliers are atypical observations that are clearly separated from the bulk of the data in various metrics to be described in this tutorial. Outliers in your data may be due to recording errors or system noise of various kinds, and as such need to be cleaned as part of the extract, transform, clean and load (ETCL) phase of the data mining/KDD process. On the other hand an outlier or small group of outliers may be quite error-free recordings that represent the most important part of your data that deserve further careful inspection, e.g., an outlier might represent an unusually high response to a particular advertising campaign, or an unusually effective dose-response combination in a drug therapy. Either way, it is quite important in data mining to detect outliers in large amounts of highly multi-dimensional data. The multidimensional aspect of the data makes this task particularly challenging. This is because highly important and influential outliers can be completely hidden in one-dimensional views of the data, which renders ineffective one-dimensional outlier detection based on scanning one field (variable, attribute) at a time. Furthermore, classical statistical methods and most ``traditional'' data mining methods lack robustness toward outliers, and have very little power to detect outliers. Indeed, the topic of outlier detection has received relatively little attention in the data mining and KDD literature. Yet there is a very large body of statistical literature on robust methods that are aimed at fitting models in a way that is not much influenced by outliers, and as a consequence provide very reliable methods of detecting outliers in multidimensional data. There is a large opportunity for data mining and KDD to benefit by borrowing and extending many of the well-established robust methods in the statistical literature. The goal of this tutorial is to stimulate efforts in this direction by data mining and KDD researchers and practitioners. We do so by providing: (a) a few striking and motivating examples of outlier detection with robust methods, (b) a comprehensive survey of robust statistical methods of detecting outliers, including key literature references, and (c) an outline of the open computational and scalability problems in data mining for outliers with robust methods, with a few suggestions of methods that are expected work in scalable data mining applications. R. Douglas Martin is Professor of Statistics at the University of Washington, and Chief Scientist of Insightful Corporation, both in Seattle, WA.. He is an author of many publications in the areas of time series and robust statistical outlier detection and modeling methods, including two invited Royal Statistical Society Discussion papers and one invited Annals of Statistics Discussion paper. Martin’s recent research focus has been on applications of statistical methods in finance and financial engineering, particularly the application of robust methods in finance, and on visual data mining. Martin was a Professor of Electrical Engineering at the University of Washington prior to becoming a Professor of Statistics in 1981. He was consultant in the Mathematics and Statistics Research Center at Bell Laboratories from 1975 to 1985, and he was Chair of the Statistics Department at the University of Washington from 1983 to 1986. In 1987 Martin founded StatSci, Inc., to develop and market the S-PLUS system for data analysis, based on the S language from Bell Laboratories. In 1993 Martin sold StatSci to MathSoft, Inc., located in Cambridge, Mass., and in 2001 the MathSoft business was consolidated in Seattle as Insightful Corporation, an S-PLUS data analysis and data mining solutions and services business. Martin holds the B.S.E. and Ph.D. degrees in Electrical Engineering from Princeton University.