KDD-2001 Tutorial

Advances in Decision Tree Construction

Johannes Gehrke, Cornell University
Wei-Yin Loh,
University of Wisconsin


Abstract Presenters
In this tutorial, we survey recent developments in learning decision tree models for classification and regression.

In the first part of the tutorial, the audience will learn about different alternatives in classification tree construction and the trade-offs involved. Topics discussed are different split selection methods, tree pruning, limitations and problems of current decision tree construction tools, such as bias in split selection, and recent advances on how to address these problems. Although we will survey the most popular methods including work from all KDD sub-communities, we will emphasize recent work that spans the statistics, machine learning, and database literature.

The second part of the tutorial will focus on the topic of regression trees, where the dependent or response variable takes ordered numerical values. We will discuss the conceptual and computational problems associated with extending the classification tree ideas to the regression context. The problems include (i) selection biases due to missing data values and to competition between categorical and numerically ordered predictor variables, (ii) piecewise-constant versus piecewise-linear models, and (iii) least squares versus non-least squares regression models. Some new algorithms designed to solve these problems will be introduced.

Johannes Gehrke is an Assistant Professor in the Department of Computer Science at Cornell University. Gehrke's research interests are in the area of data mining and database systems, and he leads the HIMALAYA Data Mining Project and the COUGAR Sensor Database System Project at Cornell. The recipient of an IBM Faculty Award and the James and Mary Tien Excellence in Teaching Award, Gehrke is the author of numerous publications on data mining and database systems. He is the co-author of the textbook "Database Management Systems (Second Edition)", published by McGrawHill in 1999, and the co-author of two patents in the area of data mining.

Wei-Yin Loh is a Professor of Statistics at the University of Wisconsin, Madison. He holds a PhD in Statistics from the University of California, Berkeley and is a Fellow of the American Statistical Association and the Institute of Mathematical Statistics. He is a past recipient of an IBM Junior Faculty Research Fellowship and a Benjamin Smith Reynolds Teaching Excellence Award. Loh's research interests are in statistical inference and methodology. His current focus is on the development of decision-tree algorithms and computer programs for statistical classification, function estimation, and data exploration. He is the developer or co-developer of several decision-tree methods, including the GUIDE regression tree algorithm and the FACT, CRUISE, and QUEST classification tree algorithms.