The tutorials will be held in the Pacific Ballroom at the Newport Beach Marriott Hotel and Tennis Club, Newport Beach, California, Thursday, August 14, 1997, from 8.00am to 6.00pm. The first day of the conference (August 14) will consist entirely of tutorials.


Attendance at the tutorials is included in the regular conference registration fee for all attendees: there are no extra fees required for tutorial attendance.


Data Mining and KDD: An Overview
Usama Fayyad, Microsoft Research and Evangelos Simoudis, IBM.
8.00am - 10.00am

We present a basic tutorial of this new and emerging area and emphasize relations to constituent communities including statistics, databases, pattern recognition, learning, and visualization. The tutorial provides a basic overview of the KDD process for extracting knowledge from databases and covers the basics of each step in the process including: data warehousing, selection and cleaning, data transformation, data mining, evaluation, and visualization. We also cover a sampling of successful applications and outline challenges and issues to be addressed.


Modelling Data and Discovering Knowledge
David Hand, Open University, UK.
10.30am - 12.30pm

Our aim is to extract knowledge from large bodies of data. The size of these bodies mean that we cannot do it unaided, but must use fast computers, applying sophisticated statistical tools. Attempts to automate the process of knowledge extraction date from at least the early 1980s, with the work on statistical expert systems. We examine this work, noting its successes and failures and, especially, what researchers in data mining and knowledge discover can learn from those efforts. We examine what data are, what information is, and what knowledge is. We contrast modelling with discovery, especially in the context of large data sets. We examine high level modelling issues, such as overfitting, generalisability, overmodelling, and model evaluation. And we examine high level exploration issues such as the discovery of accidental artefacts. The confluence of computing and statistics in some areas provides a nice backdrop against which to examine these issues, and we briefly discuss neural networks and classification trees from these two perspectives.


Text Mining - Theory and Practice
Ronen Feldman, Bar-Ilan University, Israel.
10.30am - 12.30pm

Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. In this tutorial we will present the general theory of Text Mining and will demonstrate several systems that use these principles to enable interactive exploration of large textual collections. We will describe generic techniques for text categorization and information extraction that are used by these systems. The systems that will be presented are KDT which is system for Knowledge Discovery in Texts, FACT, which discovers associations amongst keywords labeling the items in a collection of textual documents, and the Text Explorer which is a system that provides a high level language for interactive exploration of textual collections.

We will present a general architecture for text mining and will outline the algorithms and data structures behind the systems. We will give special emphasis to incremental algorithms and to efficient data structures.


Exploratory Data Analysis using Interactive Dynamic Graphics
Deborah Swayne, Bell Communications Research and Diane Cook, Iowa State University.
1.30pm - 3.30pm

Researchers and software designers in the field of data mining are just beginning to make extensive use of graphical methods. Interactive dynamic data visualization has been explored in the field of statistics for over twenty years, and we propose that much of what has been learned in statistics is relevant for data mining.

This class is an introduction to interactive data visualization as it is practiced as part of exploratory data analysis. The XGobi software, publicly available dynamic visualization software, will be used in the analysis of examples from biology, business, physics, engineering, and telecommunications.

The examples will illustrate a set of general visualization principles which are embodied in specific methods such as brushing and identification of points in simple scatterplots, three dimensional rotations, rotations in higher dimensions such as the grand tour, and directed searches in higher dimensions for interesting two dimensional views using projection pursuit and manual control.


OLAP and Data Warehousing
Surajit Chaudhuri, Microsoft Research and Umesh Dayal, Hewlett Packard Labs.
1.30pm - 3.30pm

On-Line Analytical Processing (OLAP) and Data Warehousing technologies enable enterprises to gain competitive advantage by exploiting the ever-growing amount of data that is collected and stored in corporate databases and files for better and faster decision making. Over the past few years, these technologies have experienced explosive growth, both in the number of products and services offered, and in the extent of coverage in the trade press. Vendors (including all database companies) are paying increasing attention to all aspects of decision support. The area opens up interesting research directions, with ties to past work in database systems, but with different assumptions and requirements. Only very recently, however, has the database research community started to understand and address some of these issues. This tutorial presents an overview of OLAP and data warehousing, and an in-depth study of selected aspects. An outline of the tutorial follows:

1. Introduction: definitions, evolution, differences from OLTP, architectures 2. Models and Tools: conceptual model for OLAP, front-end tools (e.g., multidimensional spreadsheets), database design (e.g., star and snowflake schema). 3. Database Server technologies for Decision Support Queries: specialized indexing techniques, specialized join and scan methods, data partitioning and use of parallelism, intelligent processing of aggregates, complex query processing, extensions to SQL, ROLAP vs. MOLAP. 4. Other Services for OLAP/Data warehousing: data cleaning, loading and refresh, tools for warehouse, system and process management, metadata management and the role of repository. 5. State of Commercial Practice. 6. Research Issues.

The target audience is researchers and developers interested in learning about the concepts, products and the technical innovations in the area of decision support technologies.


Visual Techniques for Exploring Databases
Daniel Keim, University of Munich.
4.00pm - 6.00pm

For data exploration to be effective, it is important to include the human in the exploration process and combine the flexibility, creativity, and general knowledge of the human with the enormous storage capacity and the computational power of today's computers. Visual database exploration aims at integrating the human in the exploration process, applying its perceptual abilities to the large data sets available in today's computer systems. The basic idea of visual data exploration is to present the data in some visual form, allowing the human to get insight into the data and draw conclusions. Visual data exploration techniques have proven to be of high value in exploratory data analysis and they also have a high potential for exploring large databases. Visual database exploration is especially powerful for the first steps of the data mining process, namely understanding the data and generating hypotheses about the data, but it may also significantly contribute to the actual knowledge discovery by guiding the search using visual feedback.

The goal of the tutorial is to show the potential of visualization technology for exploring large databases. The tutorial provides an overview of the state-of-the-art in data visualization and provides a classification of the existing data visualization techniques. Besides describing each of the classes, the tutorial focuses on new developments in data visualization, which are relevant to the area of knowledge discovery, and describes a wide range of recently developed techniques for visualizing large amounts of arbitrary multi-attribute data which does not have any two- or three-dimensional semantics and therefore does not lend itself to an easy display. A detailed comparison shows the strength and weaknesses of the existing techniques and reveals potentials for further improvements. Several examples demonstrate the benefits of visualization techniques for exploring databases. The tutorial concludes with an overview of existing database exploration and visualization systems, including research prototypes as well as commercial products.


Statistical Models for Categorical Response Data
William DuMouchel, AT&T Research.
4.00pm - 6.00pm
This tutorial will survey the most common models and methods statisticians use to fit and test relationships among categorical (discrete) data. Most of these techniques are described in statistics texts such as Categorical Data Analysis , by Alan Agresti, (Wiley 1990) and are widely available in popular computer packages such as SAS and Splus. Therefore it is almost de rigeur for someone with a new classification technique to compare the proposal to one or more of these standard methods. The tutorial will focus on loglinear and logistic regression models, and related models such as probit, poisson regression, and survival models. In the short time available, priority will be given to explaining why these techniques are so popular among statisticians, and to how the basic models have been extended to handle variables having more than two categories or when some of the variables have continuous or ordinal scales. Examples of model fitting, model search and model comparison using SAS and Splus will be presented and discussed.


Dr. Usama Fayyad is a Senior Researcher at Microsoft Research, the Decision Theory & Adaptive Systems Group. His research interests include knowledge discovery in large databases, data mining, machine learning, statistical pattern recognition, and clustering. After receiving the Ph.D. degree in 1991, he joined the Jet Propulsion Laboratory (JPL), California Institute of Technology (until 1996). At JPL, he headed the Machine Learning Systems Group where he developed data mining systems for analysis of large scientific databases. He remains affiliated with JPL as a Distinguished Visiting Scientist. Fayyad received the JPL 1993 Lew Allen Award for Excellence in Research, and the 1994 NASA Exceptional Achievement Medal. He was program co-chair of KDD-94 and KDD-95 (the First International Conference on Knowledge Discovery and Data Mining). He is general chair of KDD-96, an editor-in-chief of the journal: Data Mining and Knowledge Discovery, and co-editor of the new MIT Press book (1996): Advances in Knowledge Discovery and Data Mining. He serves on the Advisory Board of the Communications of ACM (CACM) and has recently guest-edited a special issue of CACM on Data Mining.

Dr. Evangelos Simoudis is Vice President, Global Business Intelligence Solutions - IBM North America, where he is responsible for the development and deployment of data mining and decision support solutions to IBM's customers worldwide. Prior to joining IBM, Evangelos worked at Lockheed Corporation where he led the company's data mining research, and was responsible for the design and commercial introduction of the Recon data mining system, as well as its application to the financial and retail markets. Dr. Simoudis received a B.A. in Physics from Grinnell College, a B.S. in Electrical Engineering from California Institute of Technology, an M.S. in Computer Science from the University of Oregon, and a Ph.D. in Computer Science from Brandeis University. Before Lockheed, Dr. Simoudis worked as a Principal Research Staff at Digital Equipment Corporation's Artificial Intelligence Center where he conducted research on machine learning and pattern recognition, knowledge-based systems, and distributed artificial intelligence. His research work at DEC has been incorporated in products for engineering design and diagnostic tasks. Dr. Simoudis has written extensively on data mining and machine learning, and is the Editor in Chief of the Artificial Intelligence Review.

Dr. David Hand is Professor of Statistics at the Open University. His research interests include the foundations of statistics, statistical computing, and multivariate statistics, the latter especially as applied to classification problems. His applications interests include medicine, finance, and psychology. He is Editor-in-Chief of 'Statistics and Computing' and has has published fourteen books, the most recent of which is 'Construction and Assessment of Classification Rules', Wiley, January 1997.

Dr. Ronen Feldman is a lecturer at the Mathematics and Computer Science Department of Bar-Ilan University in Israel. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University, and his Ph.D. in Computer Science from Cornell University. His main research is in the area of Machine Learning and Data Mining. In particular, he is pioneering now the application of data mining techniques to textual collections. He is currently coordinating several research projects for developing dedicated text mining systems. These systems work on plain text collections and on the Internet. He authored numerous papers on scheduling, theory revision, text mining, and association generation.

Deborah Swayne has worked at Bellcore since that company's inception in 1985, and is currently a member of the Statistics and Data Mining Research Group. Her research focusses on software methods for visualizing data. She is one of the authors of the XGobi software, originally developed at Bellcore. She has a Bachelor's degree in African Linguistics from the University of Wisconsin at Madison, and a Master's degree in Statistics from Rutgers University.

Dr. Dianne Cook is an Assistant Professor in the Department of Statistics, Iowa State University. She received her PhD from Rutgers University in May 1993, and has conducted research into dynamic statistical graphics. Her interests include using these methods for understanding high-dimensional data, and adapting them for analyzing geographically referenced data with multiple measurements at each site.

Dr. Daniel Keim is one of the leading experts in the field of visual database exploration. Dr. Keim developed several new techniques which use visualization technology for the purpose of exploring large databases, and he was the chief engineer in designing the VisDB system - a visual database exploration system. Dr. Keim has published extensively on visualization and data mining, and he has given presentations on related issues at other large conferences. Dr. Keim received his diploma (equivalent to an MS degree) in Computer Science from the University of Dortmund in 1990 and his Ph.D. in Computer Science from the University of Munich in 1994. Currently, he is a teaching and research assistant (approximately equivalent to an assistant professor) at the Institute for Computer Science of the University of Munich, Germany.

Dr. Surajit Chaudhuri is a researcher in the Database Research Group of Microsoft Research. From 1992 to 1995, he was a Member of the Technical Staff at Hewlett-Packard Laboratories, Palo Alto. He did his B.Tech at the Indian Instiute of Technology, Kharagpur and his Ph.D. at Stanford University. Surajit has published in SIGMOD, VLDB and PODS in the area of optimization of queries for decision-support and multimedia systems. He served on the program committees for VLDB 1996 and the International Conference on Database Theory (ICDT), 1997. He is a vice-chair of the Program Committee for the International Conference on Data Engineering (ICDE), 1997. In addition to query processing and optimization, Surajit is interested in the areas of data mining, database design and uses of databases for nontraditional applications.

Dr. Umesh Dayal is a senior researcher at Hewlett-Packard Labs., Palo Alto, California. His current research interests are in distributed information systems, workflow management, data mining, and information management issues related to the emerging global information infrastructure. Previously, he was at Digital Equipment Corporation's Cambridge Research Laboratory; Chief Scientist at Xerox Advanced Information Technology and Computer Corporation of America; and an Assistant Professor at the University of Texas-Austin. His earlier research was on object-oriented and active database systems, temporal databases, heterogeneous distributed database systems, query processing, and view mapping problems. He received the Ph.D. and S.M. degrees from Harvard University, the M.E. and B.E. degrees from the Indian Institute of Science, and the B.Sc. degree from Osmania University, India. Umesh has served as an Associate Editor of ACM Transactions on Database Systems. He is a member of the Board of Trustees of the VLDB Endowment, and the Board of the International Foundation on Cooperative Information Systems. He has chaired and served on the Program Committees for numerous conferences. Most recently, he was Industry Track Program Chair for the International Conference on Data Engineering, 1996, and the American Program Chair for the International Conference on Very Large Data Bases, 1995. He is a co-editor of the books: On Object-Oriented Database Systems (Springer Verlag, 1991) and Distributed Object Management (Morgan Kaufmann, 1993), and has published over 75 research papers. He is a member of the ACM and IEEE.

Dr. William DuMouchel is member of technical staff, AT&T Labs - Research. He received the Ph.D. in Statistics from Yale University in 1971 and has taught the analysis of categorical data and other topics in theoretical and applied statistics at UC Berkeley, U. of Michigan, U. of London, MIT, and Columbia U. From 1987 to 1992 he served as the Chief Statistical Scientist at BBN Software Products Corporation, helping to design and develop their software advisory systems for data analysis and experimental design, RS/Explore and RS/Discover. He is an elected fellow of the American Statistical Association and the Institute for Mathematical Statistics, and has published in many areas of Statistics, including estimation for Stable Laws, environmental and insurance risk analysis, Bayesian meta-analysis, statistical computing, and design of experiments.


For further information contact Padhraic Smyth University of California, Irvine (KDD-97 Tutorials Chair).

home | top