|
Outliers are atypical observations that are clearly
separated from the bulk of the data in various metrics to be described in this
tutorial. Outliers in your data may be due to recording errors or system
noise of various kinds, and as such need to be cleaned as part of the extract,
transform, clean and load (ETCL) phase of the data mining/KDD process. On the
other hand an outlier or small group of outliers may be quite error-free
recordings that represent the most important part of your data that deserve
further careful inspection, e.g., an outlier might represent an unusually high
response to a particular advertising campaign, or an unusually effective
dose-response combination in a drug therapy. Either way, it is quite
important in data mining to detect outliers in large amounts of highly
multi-dimensional data. The multidimensional aspect of the data makes this
task particularly challenging. This is because highly important and
influential outliers can be completely hidden in one-dimensional views of the
data, which renders ineffective one-dimensional outlier detection based on
scanning one field (variable, attribute) at a time. Furthermore, classical
statistical methods and most ``traditional'' data mining methods lack
robustness toward outliers, and have very little power to detect outliers.
Indeed, the topic of outlier detection has received relatively little
attention in the data mining and KDD literature. Yet there is a very large
body of statistical literature on robust methods that are aimed at fitting
models in a way that is not much influenced by outliers, and as a consequence
provide very reliable methods of detecting outliers in multidimensional data.
There is a large opportunity for data mining and KDD to benefit by borrowing
and extending many of the well-established robust methods in the statistical
literature. The goal of this tutorial is to stimulate efforts in this
direction by data mining and KDD researchers and practitioners. We do so by
providing: (a) a few striking and motivating examples of outlier detection
with robust methods, (b) a comprehensive survey of robust statistical methods
of detecting outliers, including key literature references, and (c) an outline
of the open computational and scalability problems in data mining for outliers
with robust methods, with a few suggestions of methods that are expected work
in scalable data mining applications.
|
|
R. Douglas Martin is Professor of Statistics at the University
of
Washington, and Chief Scientist of Insightful Corporation, both in
Seattle, WA.. He is an author of many publications in the areas of
time series and robust statistical outlier detection and modeling
methods, including two invited Royal Statistical Society Discussion
papers and one invited Annals of Statistics Discussion paper.
Martin’s recent research focus has been on applications of statistical
methods in finance and financial engineering, particularly the
application of robust methods in finance, and on visual data mining.
Martin was a Professor of Electrical Engineering at the University of
Washington prior to becoming a Professor of Statistics in 1981. He
was consultant in the Mathematics and Statistics Research Center at
Bell Laboratories from 1975 to 1985, and he was Chair of the
Statistics Department at the University of Washington from 1983 to
1986. In 1987 Martin founded StatSci, Inc., to develop and market the
S-PLUS system for data analysis, based on the S language from Bell
Laboratories. In 1993 Martin sold StatSci to MathSoft, Inc., located
in Cambridge, Mass., and in 2001 the MathSoft business was
consolidated in Seattle as Insightful Corporation, an S-PLUS data
analysis and data mining solutions and services business. Martin
holds the B.S.E. and Ph.D. degrees in Electrical
Engineering from Princeton University.
|