Ramakrishnan
Srikant
IBM Almaden Research Center
650 Harry Road, San Jose, CA 95120
A fruitful direction for future data
mining research will be the development of techniques that incorporate privacy
concerns. Specifically, we address the following question: can we develop
accurate models over aggregated data while preserving privacy at the level of
individual data records?
To illustrate the idea of
privacy-preserving data mining, we consider the concrete case of building a
decision-tree classifier from training data in which the values of individual
records have been perturbed. The
resulting data records look very different from the original records and the
distribution of data values is also very different from the original
distribution. While it is not
possible to accurately estimate original values in individual data records, we
propose a novel Bayesian reconstruction procedure to accurately estimate the
distribution of original data values. By
using these reconstructed distributions, we are able to build classifiers whose
accuracy is close to the accuracy of classifiers built with the original data.
Next, we focus on two temporal mining
problems: mining sequential patterns, and discovering trends over time.
We discuss the applicability of the above techniques to these problems,
and show that protecting privacy at the individual level while still discovering
sequential patterns and trends is a challenging open research problem.