Sequential Patterns, Trends, and Privacy

Ramakrishnan Srikant

 IBM Almaden Research Center

650 Harry Road, San Jose, CA 95120

  

Invited Talk Abstract

 

A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question: can we develop accurate models over aggregated data while preserving privacy at the level of individual data records?

 

To illustrate the idea of privacy-preserving data mining, we consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have been perturbed.  The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution.  While it is not possible to accurately estimate original values in individual data records, we propose a novel Bayesian reconstruction procedure to accurately estimate the distribution of original data values.  By using these reconstructed distributions, we are able to build classifiers whose accuracy is close to the accuracy of classifiers built with the original data.

 

Next, we focus on two temporal mining problems: mining sequential patterns, and discovering trends over time.  We discuss the applicability of the above techniques to these problems, and show that protecting privacy at the individual level while still discovering sequential patterns and trends is a challenging open research problem.