|
|
|

|
|


The tutorials will be held in the Pacific Ballroom at the
Newport Beach Marriott Hotel and Tennis Club, Newport Beach, California,
Thursday, August 14, 1997, from 8.00am to 6.00pm.
The first day of the conference (August 14) will consist entirely
of tutorials.
|


Attendance at the tutorials is included in the regular conference registration
fee for all attendees: there are no extra fees required for tutorial
attendance.
|


Data Mining and KDD: An Overview
Usama Fayyad, Microsoft Research and
Evangelos Simoudis, IBM.
8.00am - 10.00am
We present a basic tutorial of this new and emerging area and
emphasize relations to constituent communities including statistics,
databases, pattern recognition, learning, and visualization. The
tutorial provides a basic overview of the KDD process for extracting
knowledge from databases and covers the basics of each step in the
process including: data warehousing, selection and cleaning,
data transformation, data mining, evaluation, and visualization.
We also cover a sampling of successful applications and outline
challenges and issues to be addressed.
|


Modelling Data and Discovering Knowledge
David Hand, Open University, UK.
10.30am - 12.30pm
Our aim is to extract knowledge from large bodies of data. The size of
these bodies mean that we cannot do it unaided, but must use fast computers,
applying sophisticated statistical tools. Attempts to automate the process
of knowledge extraction date from at least the early 1980s, with the work on
statistical expert systems. We examine this work, noting its successes and
failures and, especially, what researchers in data mining and knowledge
discover can learn from those efforts. We examine what data are, what
information is, and what knowledge is. We contrast modelling with
discovery, especially in the context of large data sets. We examine high
level modelling issues, such as overfitting, generalisability,
overmodelling, and model evaluation. And we examine high level exploration
issues such as the discovery of accidental artefacts. The confluence of
computing and statistics in some areas provides a nice backdrop against
which to examine these issues, and we briefly discuss neural networks and
classification trees from these two perspectives.
|


Text Mining - Theory and Practice
Ronen Feldman, Bar-Ilan University, Israel.
10.30am - 12.30pm
Knowledge Discovery in Databases (KDD) focuses on the computerized
exploration of large amounts of data and on the discovery of interesting
patterns within them. While most work on KDD has been concerned with
structured databases, there has been little work on handling the huge
amount of information that is available only in unstructured textual form.
In this tutorial we will present the general theory of Text Mining and will
demonstrate several systems that use these principles to enable interactive
exploration of large textual collections. We will describe generic
techniques for text categorization and information extraction that are used
by these systems. The systems that will be presented are KDT which is
system for Knowledge Discovery in Texts, FACT, which discovers associations
amongst keywords labeling the items in a collection of textual documents,
and the Text Explorer which is a system that provides a high level language
for interactive exploration of textual collections.
We will present a general architecture for text mining and will outline the
algorithms and data structures behind the systems. We will give special
emphasis to incremental algorithms and to efficient data structures.
|


Exploratory Data Analysis using Interactive Dynamic Graphics
Deborah Swayne, Bell Communications Research
and
Diane Cook, Iowa State University.
1.30pm - 3.30pm
Researchers and software designers in the field of data mining
are just beginning to make extensive use of graphical methods.
Interactive dynamic data visualization has been explored
in the field of statistics for over twenty years, and we
propose that much of what has been learned in statistics is
relevant for data mining.
This class is an introduction to interactive data visualization as
it is practiced as part of exploratory data analysis. The XGobi
software, publicly available dynamic visualization software, will
be used in the analysis of examples from biology, business,
physics, engineering, and telecommunications.
The examples will illustrate a set of general visualization principles
which are embodied in specific methods such as brushing and
identification of points in simple scatterplots, three dimensional
rotations, rotations in higher dimensions such as the grand tour, and
directed searches in higher dimensions for interesting two dimensional
views using projection pursuit and manual control.
|


OLAP and Data Warehousing
Surajit Chaudhuri,
Microsoft Research and
Umesh Dayal, Hewlett Packard Labs.
1.30pm - 3.30pm
On-Line Analytical Processing (OLAP) and Data Warehousing technologies
enable enterprises to gain competitive advantage by exploiting the
ever-growing amount of data that is collected and stored in corporate
databases and files for better and faster decision making. Over the
past few years, these technologies have experienced explosive growth,
both in the number of products and services offered, and in the extent
of coverage in the trade press. Vendors (including all database companies)
are paying increasing attention to all aspects of decision support.
The area opens up interesting research directions, with ties to past
work in database systems, but with different assumptions and
requirements. Only very recently, however, has the database research
community started to understand and address some of these issues.
This tutorial presents an overview of OLAP and data warehousing, and an
in-depth study of selected aspects. An outline of the tutorial follows:
1. Introduction: definitions, evolution, differences from OLTP, architectures
2. Models and Tools: conceptual model for OLAP,
front-end tools (e.g., multidimensional spreadsheets),
database design (e.g., star and snowflake schema).
3. Database Server technologies for Decision Support
Queries: specialized indexing techniques,
specialized join and scan methods,
data partitioning and use of parallelism,
intelligent processing of aggregates,
complex query processing,
extensions to SQL,
ROLAP vs. MOLAP.
4. Other Services for OLAP/Data warehousing:
data cleaning, loading and refresh,
tools for warehouse, system and process management,
metadata management and the role of repository.
5. State of Commercial Practice.
6. Research Issues.
The target audience is
researchers and developers interested in learning about the concepts,
products and the technical innovations in the area of decision support
technologies.
|


Visual Techniques for Exploring Databases
Daniel Keim, University of Munich.
4.00pm - 6.00pm
For data exploration to be effective, it is important to include the human in
the exploration process and combine the flexibility, creativity, and general
knowledge of the human with the enormous storage capacity and the
computational power of today's computers. Visual database exploration aims
at integrating the human in the exploration process, applying its perceptual
abilities to the large data sets available in today's computer systems. The
basic idea of visual data exploration is to present the data in some visual
form, allowing the human to get insight into the data and draw conclusions.
Visual data exploration techniques have proven to be of high value in
exploratory data analysis and they also have a high potential for exploring
large databases. Visual database exploration is especially powerful for the
first steps of the data mining process, namely understanding the data and
generating hypotheses about the data, but it may also significantly
contribute to the actual knowledge discovery by guiding the search using
visual feedback.
The goal of the tutorial is to show the potential of visualization technology
for exploring large databases. The tutorial provides an overview of the
state-of-the-art in data visualization and provides a classification of the
existing data visualization techniques. Besides describing each of the
classes, the tutorial focuses on new developments in data visualization,
which are relevant to the area of knowledge discovery, and describes a wide
range of recently developed techniques for visualizing large amounts of
arbitrary multi-attribute data which does not have any two- or
three-dimensional semantics and therefore does not lend itself to an easy
display. A detailed comparison shows the strength and weaknesses of the
existing techniques and reveals potentials for further improvements. Several
examples demonstrate the benefits of visualization techniques for exploring
databases. The tutorial concludes with an overview of existing database
exploration and visualization systems, including research prototypes as well
as commercial products.
|


Statistical Models for Categorical Response Data
William DuMouchel, AT&T Research.
4.00pm - 6.00pm
This tutorial will survey the most common models and methods statisticians
use to fit and test relationships among categorical (discrete) data. Most
of these techniques are described in statistics texts such as
Categorical
Data Analysis , by Alan Agresti, (Wiley 1990) and are widely available in
popular computer packages such as SAS and Splus. Therefore it is almost de
rigeur for someone with a new classification technique to compare the
proposal to one or more of these standard methods. The tutorial will focus
on loglinear and logistic regression models, and related models such as
probit, poisson regression, and survival models. In the short time
available, priority will be given to explaining why these techniques are so
popular among statisticians, and to how the basic models have been extended
to handle variables having more than two categories or when some of the
variables have continuous or ordinal scales. Examples of model fitting,
model search and model comparison using SAS and Splus will be presented and
discussed.
|


Dr. Usama
Fayyad is a Senior Researcher at
Microsoft Research, the
Decision Theory & Adaptive
Systems Group. His research interests include
knowledge discovery in large databases, data mining, machine
learning, statistical pattern recognition, and clustering. After
receiving the Ph.D. degree in 1991, he joined the Jet Propulsion Laboratory (JPL), California Institute of Technology
(until 1996). At JPL, he headed the Machine Learning Systems
Group where he developed
data mining systems for analysis of large scientific databases. He
remains affiliated with JPL as a Distinguished Visiting Scientist.
Fayyad received the JPL 1993 Lew Allen Award for Excellence in Research,
and the 1994 NASA Exceptional Achievement Medal. He was program co-chair
of KDD-94 and KDD-95 (the First International Conference
on Knowledge Discovery and Data Mining). He is general chair of KDD-96, an editor-in-chief
of the journal: Data Mining
and Knowledge Discovery, and co-editor of the new MIT Press book
(1996):
Advances in Knowledge Discovery and Data Mining. He serves on the
Advisory Board of the Communications of ACM (CACM) and has recently
guest-edited a special issue of CACM on Data Mining.
Dr. Evangelos Simoudis
is Vice President, Global Business Intelligence
Solutions - IBM North America, where he is responsible for the
development and deployment of data mining and decision support
solutions to IBM's customers worldwide.
Prior to joining IBM, Evangelos worked at Lockheed Corporation where
he led the company's data mining research, and was responsible for the
design and commercial introduction of the Recon data mining system,
as well as its application to the financial and retail
markets.
Dr. Simoudis received a B.A. in Physics from Grinnell College, a B.S. in
Electrical Engineering from California Institute of Technology, an
M.S. in Computer Science from the University of Oregon, and a Ph.D. in
Computer Science from Brandeis University.
Before Lockheed, Dr. Simoudis worked as a Principal
Research Staff at Digital Equipment Corporation's Artificial
Intelligence Center where he conducted research on machine learning
and pattern recognition, knowledge-based systems, and distributed
artificial intelligence. His research work at DEC has been
incorporated in products for engineering design and diagnostic tasks.
Dr. Simoudis has written extensively on data mining and machine
learning, and is the Editor in Chief of the Artificial
Intelligence Review.
Dr. David Hand is Professor of Statistics at the Open University.
His
research interests include the foundations of statistics, statistical
computing, and multivariate statistics, the latter especially as applied to
classification problems. His applications interests include medicine,
finance, and psychology. He is Editor-in-Chief of 'Statistics and
Computing' and has has published fourteen books, the most recent of which is
'Construction and Assessment of Classification Rules', Wiley, January 1997.
Dr. Ronen Feldman
is a lecturer at the Mathematics and Computer Science
Department of Bar-Ilan University in Israel. He received his B.Sc. in Math,
Physics and Computer Science from the Hebrew University, and his Ph.D. in
Computer Science from Cornell University. His main research is in the area
of Machine Learning and Data Mining. In particular, he is pioneering now
the application of data mining techniques to textual collections. He is
currently coordinating several research projects for developing dedicated
text mining systems. These systems work on plain text collections and on
the Internet. He authored numerous papers on scheduling, theory revision,
text mining, and association generation.
Deborah Swayne
has worked at Bellcore since that company's
inception in 1985, and is currently a member of the Statistics
and Data Mining Research Group. Her research focusses on
software methods for visualizing data. She is one of the
authors of the XGobi software, originally developed at Bellcore.
She has a Bachelor's degree in African Linguistics from
the University of Wisconsin at Madison, and a Master's degree in
Statistics from Rutgers University.
Dr. Dianne Cook
is an Assistant Professor in the Department of Statistics,
Iowa State University. She received her PhD from Rutgers University in
May 1993, and has conducted research into dynamic statistical
graphics. Her interests include using these methods for understanding
high-dimensional data, and adapting them for analyzing geographically
referenced data with multiple measurements at each site.
Dr. Daniel Keim
is one of the leading experts in the field of visual database exploration.
Dr. Keim developed several new techniques which use visualization technology
for the purpose of exploring large databases, and he was the chief engineer in
designing the VisDB system - a visual database exploration system. Dr. Keim
has published extensively on visualization and data mining, and he has given
presentations on related issues at other large conferences.
Dr. Keim received his diploma (equivalent to an MS degree) in Computer
Science from the University of Dortmund in 1990 and his Ph.D. in Computer
Science from the University of Munich in 1994. Currently, he is a teaching
and research assistant (approximately equivalent to an assistant professor)
at the Institute for Computer Science of the University of Munich, Germany.
Dr. Surajit Chaudhuri
is a researcher in the Database Research Group of
Microsoft Research. From 1992 to 1995, he was a Member of the Technical
Staff at Hewlett-Packard Laboratories, Palo Alto. He did his B.Tech
at the Indian Instiute of Technology, Kharagpur and his Ph.D. at
Stanford University. Surajit has published in SIGMOD, VLDB and PODS in
the area of optimization of queries for decision-support and multimedia
systems. He served on the program committees for VLDB 1996 and
the International Conference on Database Theory (ICDT), 1997. He is a
vice-chair of the Program Committee for the International Conference on
Data Engineering (ICDE), 1997. In addition to query processing and
optimization, Surajit is interested in the areas of data mining,
database design and uses of databases for nontraditional applications.
Dr. Umesh Dayal is a
senior researcher at Hewlett-Packard Labs., Palo
Alto, California. His current research interests are in distributed
information systems, workflow management, data mining, and information
management issues related to the emerging global information
infrastructure. Previously, he was at Digital Equipment Corporation's
Cambridge Research Laboratory; Chief Scientist at Xerox Advanced
Information Technology and Computer Corporation of America; and an
Assistant Professor at the University of Texas-Austin. His earlier
research was on object-oriented and active database systems, temporal
databases, heterogeneous distributed database systems, query processing,
and view mapping problems. He received the Ph.D. and S.M. degrees from
Harvard University, the M.E. and B.E. degrees from the Indian Institute
of Science, and the B.Sc. degree from Osmania University, India. Umesh
has served as an Associate Editor of ACM Transactions on Database
Systems. He is a member of the Board of Trustees of the VLDB Endowment,
and the Board of the International Foundation on Cooperative Information
Systems. He has chaired and served on the Program Committees for
numerous conferences. Most recently, he was Industry Track Program Chair
for the International Conference on Data Engineering, 1996, and the
American Program Chair for the International Conference on Very Large
Data Bases, 1995. He is a co-editor of the books: On Object-Oriented
Database Systems (Springer Verlag, 1991) and Distributed Object
Management (Morgan Kaufmann, 1993), and has published over 75 research
papers. He is a member of the ACM and IEEE.
Dr. William DuMouchel is member of technical staff, AT&T Labs - Research. He
received the Ph.D. in Statistics from Yale University in 1971 and has
taught the analysis of categorical data and other topics in theoretical and
applied statistics at UC Berkeley, U. of Michigan, U. of London, MIT, and
Columbia U. From 1987 to 1992 he served as the Chief Statistical Scientist
at BBN Software Products Corporation, helping to design and develop their
software advisory systems for data analysis and experimental design,
RS/Explore and RS/Discover. He is an elected fellow of the American
Statistical Association and the Institute for Mathematical Statistics, and
has published in many areas of Statistics, including estimation for Stable
Laws, environmental and insurance risk analysis, Bayesian meta-analysis,
statistical computing, and design of experiments.
|


For further information contact
Padhraic
Smyth University of California, Irvine (KDD-97 Tutorials
Chair).
|

|
|