KDD-2003 |
The Ninth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining |
| Washington, DC, USA | August 24 - 27, 2003 |
|
|
Program
|
|
|
|
|
| KDD 2003 >> Program >> Tutorials |
|
KDD 2003 - Tutorials | |
![]() |
I. Information Extraction from the World Wide WebWilliam Cohen and Andrew McCallumII. Multi-Relational Data MiningSaso Dzeroski and Luc De RaedtIII. Data Mining for Computer SecurityCarla Brodley and Philip ChanIV. Data Mining for Machine LearnersJohannes Gehrke and Jiawei HanV. Privacy-Preserving Data MiningChris CliftonVI. Sequence Data Mining Techniques and ApplicationsSunita Sarawagi and Mark CravenVII. The Top 10 Data Mining Mistakes -- and How to Avoid ThemJohn F. Elder |
| I. Information Extraction from the World Wide Web |
|
Title: Information Extraction from the Web
Presenters:
William W. Cohen
Andrew McCallum Abstract: The Web is the world's largest knowledge base. However, its data is in a form intended for human reading, not manipulation, data mining and reasoning by computers. Information extraction is the process of filling fields in a database by automatically extracting sub-sequences of human readable text. Today's search engines return web pages. Tomorrow's search engines will use information extraction to return "things" (like people, jobs, companies, events), their relations, facts and trends. This tutorial will survey many of the sub-problems and methods of information extraction, including use of sliding-window and finite state machines, language and formatting features, generative and conditional models, rule-learning and Bayesian techniques. We will also discuss some related issues, such as association of data fields into records, reference matching and de-duplication. Intended audience: This tutorial is intended to review recent research results, with an emphasis on techniques which are practical for difficult, large-scale, information extraction problems. A familiarity with statistical learning techniques, such as maximum likelihood estimation, Bayesian networks, and hidden Markov models, would be useful, but is not required. Biographies: William Cohen received his bachelor's degree in Computer Science from Duke University in 1984 and his PhD in Computer Science from Rutgers University in 1990. From 1990 to 2000, Cohen worked at AT&T Labs-Research, then later at Whizbang Labs, a company specializing in information extraction from the World Wide Web. Cohen is now a Senior Research Scientist at Carnegie Mellon University, and also consults with Intelliseek, a company that uses learning techniques to extract and aggregates public statements of consumer sentiment. Cohen is currently an action editor for the Journal of Machine Learning Research, and has served as an editor for the journal Machine Learning and the Journal of Artificial Intelligence Research. He co-organized the 1994 International Machine Learning Conference and has served on more than 20 program or advisory committees. His research interests include information extraction, information integration and machine learning, particularly text categorization and learning from large datasets. Andrew McCallum is an Associate Professor at University of Massachusetts, Amherst. He was previously Vice President of Research and Development at WhizBang Labs, a company that used machine learning for information extraction from the World Wide Web. In the late 1990s, he was a Research Coordinator at Justsystem Pittsburgh Research Center. He received his PhD in computer science from University of Rochester in 1995 and was a post-doctoral fellow at Carnegie Mellon University in 1996. He is on the editorial board of the Journal of Machine Learning Research and has co-organized numerous technical workshops. For the past eight years, McCallum has been active in research on statistical machine learning applied to text, especially information extraction, document classification, finite state models, and learning from combinations of labeled and unlabeled data.
|
| II. Multi-Relational Data Mining |
|
Title:
Multi-Relational Data Mining
Presenters:
Saso Dzeroski
Luc De Raedt Abstract: Multi Relational Data Mining (MRDM) is the multi-disciplinary field dealing with knowledge discovery from relational databases consisting of multiple tables (relations). The adjective multi relational is used to emphasize the contrast to typical data mining approaches that look for patterns in a single relation of a database. The field aims at integrating results from existing fields such as inductive logic programming, KDD, data mining, machine learning and relational databases; producing new techniques for mining multi-relational data; and practical applications of such techniques. Present MRDM approaches consider all of the main data mining tasks, including association analysis, classification, clustering, learning probabilistic models and regression. The pattern languages used by single-table data mining approaches for these data mining tasks have been extended to the multiple-table case. Relational pattern languages now include relational association rules, relational classification rules, relational decision trees, and probabilistic relational models, among others. MRDM methods have been successfully applied across many application areas, ranging from the analysis of business data, through bioinformatics (including the analysis of complete genomes) and pharmacology (drug design) to Web mining (e.g., information extraction from Web sources). Intended audience: The tutorial on Relational Data Mining will provide a coherent introduction to the basic concepts, techniques and applications of relational data mining. The tutorial is therefore intended for data mining researchers and practitioners, as well as domain experts interested in mining truly relational data (structured or multi-table data). Basic knowledge of relational databases and/or data mining techniques is helpful, but is not a prerequisite. Biographies: Saso Dzeroski, Assistant Professor, is a Senior Scientific Associate of the Department of Intelligent System, Jozef Stefan Institute, Ljubljana, Slovenia. He has actively participated in the creation, shaping and setting of research directions of the area of inductive logic programming (ILP), and more recently relational data mining (RDM). He has performed research on the theory of ILP, techniques of ILP (handling noisy and numeric data) and their use on practical problems of knowledge discovery from environmental data, life sciences data and natural language resources. He is the co-author/co-editor of three books in the area of ILP: Inductive Logic Programming: Techniques and Applications, the first authored book on ILP, Learning Language in Logic, concerned with learning from natural language resources, and finally the book Relational Data Mining. The latter provides a comprehensive and coherent overview of basic concepts, techniques and applications of RDM. Luc De Raedt is a full Professor at the Albert-Ludwigs-University Freiburg and head of the Machine Learning and Natural Language Processing Lab. His research interests are in Multi-Relational Data Mining, Inductive Logic Programming, Inductive Databases and Constraint-Based Data Mining, as well as their applications. He was the co-coordinator of the European projects on Inductive Logic Programming I and II and the initiator initiator, organiser and program co-chairman of the the first co-located ECML / PKDD conferences in Freiburg 2001. He has delivered tutorials on "Inductive Logic Programming" (and more recently on "Inductive Databases") on numerous occasions. Saso Dzeroski and Luc De Raedt are editing a special issue of SIGKDD Explorations on Multi Relational Data Mining, to appear in 2003.
|
| III. Data Mining for Computer Security |
|
Title:
Data Mining for Computer Security
Presenters:
Carla E. Brodley,
Philip Chan Abstract: In the past few years there has been a monumental surge of interest in computer security in the private, public, university and government sectors. This tutorial consists of an introduction to computer security as well as an overview of existing research on applications of KDD to computer security. For KDD researchers and practitioners, the tutorial will provide background knowledge and opportunities for applying KDD to computer security. For computer security researchers and practitioners, it provides knowledge on how KDD can benefit and enhance computer security. Audience: The expected audience for this tutorial is KDD practitioners interested in applying KDD to a new application domain, KDD researchers interested in KDD issues related to computer security, and computer security professional/researchers/government employees interested in the state of the art in applications of KDD to security. The audience is not expected to be familiar with computer security. However, the audience is expected to have basic knowledge in computer science and KDD. Biographies: Carla E. Brodley is an associate professor in the School of Electrical and Computer Engineering at Purdue University. She received her bachelors degree from McGill University in 1985 and her PhD in computer science from the University of Massachusetts in 1994. In 2001 she served as program co-chair for the International Conference on Machine Learning. Currently she is an associate editor of the Journal of Artificial Intelligence Research and serves on the editorial board of the Journal of Machine Learning Research. She has worked in the areas of intrusion detection, hardware support for security, anomaly detection in networks, classifier formation, and feature selection for unsupervised learning. Prof Brodley has taught undergraduate computer security classes at Purdue and graduate classes on data mining and machine learning. Philip Chan is an associate professor of computer science at Florida Institute of Technology. He is currently on sabbatical leave at Laboratory of Computer Science, Massachusetts Institute of Technology. He received his PhD, MS, and BS in computer science from Columbia University, Vanderbilt University, and Southwest Texas State University respectively. His main research interests include scalable adaptive methods, machine learning, data mining, distributed and parallel computing, and intelligent systems. His recent research focuses on machine learning techniques for anomaly detection. He has published papers and received support from DARPA in the area of machine learning and intrusion detection. Prof. Chan has served as program committee members for the major data mining conferences: KDD, ICDM, and SDM, and is on the editorial board of Journal of Database Management. With Prof. Chan's efforts, ICDM 2003 will be held in Melbourne, Florida. He co-edited the book "Advances in Parallel and Distributed Knowledge Discovery," AAAI/MIT Press, 2000.
|
| IV. Data Mining for Machine Learners |
|
Title:Data Mining for Machine Learners
Presenters:
Johannes Gehrke (Cornell University) Abstract: This tutorial introduces an audience with a machine learning background to recent developments in data mining. In the first part of the tutorial, we survey methodologies for scaling data mining algorithms to large datasets, using decision tree construction as an example. We will develop the algorithms by reviewing the computational bottlenecks of existing algorithms, describing some general principles how to overcome such scalability bottlenecks, and then we show how existing algorithms from the data mining literature have applied these principles. In the second part of the tutorial, we survey recent developments on efficient and scalable methods for mining association rules, correlations, frequent, sequential, and structured patterns. Variations of concepts and techniques will be examined, including how to mine interesting patterns, how to perform constraint-based mining, and so on. We will show that frequent pattern mining is an important research direction with lots of applications, including classification, clustering, correlation analysis, and outlier detection. Audience: This tutorial is targeted at practitioners and researchers with background in machine learning who would like to gain a solid understanding of principles for scaling data mining algorithms, and a thorough introduction to scalable association rule mining algorithms. Biographies: Johannes Gehrke is an Assistant Professor in the Department of Computer Science at Cornell University. He obtained his Ph.D. in computer science from the University of Wisconsin-Madison in 1999; his graduate studies were supported by a Fulbright fellowship and an IBM fellowship. Johannes' research interests are in the areas of data mining, data stream processing, and distributed data management for sensor networks and peer-to-peer networks. Johannes has received a National Science Foundation Career Award, an Arthur P. Sloan Fellowship, an IBM Faculty Award, and the Cornell College of Engineering James and Mary Tien Excellence in Teaching Award. He is the author of numerous publications on data mining and database systems, and he co-authored the undergraduate textbook "Database Management Systems" (McGraw-Hill (2002), currently in its third edition). Johannes has given courses and tutorials on data mining and data stream processing at international conferences and on Wall Street, and he has extensive industry experience as technical advisor. Jiawei Han is a Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign. Previously, he was an Endowed University Professor at Simon Fraser University, Canada. He has been working on research into data mining, data warehousing, database systems, spatial databases, deductive and object-oriented databases, and bio-medical databases, with over 200 journal and conference publications. He has chaired or served in many program committees of international conferences and workshops, including ACM SIGKDD conferences (2001 best paper award chair, 2002 student award chair), SIAM-Data Mining Conference (2001 and 2002 PC co-chair), ACM SIGMOD conferences (2000 exhibit program chair), and International Conference on Data Engineering (2004 and 2002 PC vice-chair). He has also been serving on the Board of Directors for the Executive Committee of ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD). Jiawei has received an IBM Faculty Award, the Outstanding Contribution Award at the 2002 International Conference on Data Mining, and an ACM Service Award. He is the first author of the textbook "Data Mining: Concepts and Techniques" (Morgan Kaufmann, 2001). |
| V. Privacy-Preserving Data Mining |
|
Title:
Privacy Preserving Data Mining
How do we mine data when we aren't allowed to see it?
Presenter:
Dr. Chris Clifton, Associate Professor Abstract: One of the key requirements of data mining is access to the relevant data. Privacy concerns can constrain such access, threatening to derail data mining projects. This tutorial discusses constraints imposed by privacy, including sources of constraints such legal and societal issues, and how they impact the mining process. There has recently been a surge in solutions to these issues: data mining techniques that work in spite of constrained access to data [AS00, LP00, AA01, KC02, VC02, ESAG02, EC02, Geh03]. The broad types of methods for privacy preserving data mining will be discussed, and several examples techniques will be detailed. Problems There are many data mining situations where these privacy and security issues arise. A few examples are: Identifying public health problem outbreaks (e.g., epidemics, biological warfare instances). There are many data collectors (insurance companies, HMOs, public health agencies). Individual privacy concerns limit the willingness of the data custodians to share data, even with government agencies such as the U.S. Centers for Disease Control. Can we accomplish the desired results while still preserving privacy of individual entities? Collaborative corporations or entities. Ford and Firestone shared a problem with a jointly produced product: Ford Explorers with Firestone tires. Ford and Firestone may have been able to use association rule techniques to detect problems earlier. This would have required extensive data sharing. Factors such as trade secrets and agreements with other manufacturers stand in the way of the necessary sharing. Could we obtain the same results, while still preserving the secrecy of each side's data? Government entities face similar problems, such as limitations on sharing between law enforcement, intelligence agencies, and tax collection. Multi-national corporations. An individual country's legal system may prevent sharing of customer data between a subsidiary and its parent. These examples each define a different problem, or set of problems. The problems can be characterized by the following three parameters: Outcome What is the desired data mining result? Do we want to cluster the data, as in the disease outbreak example? Are we looking for association rules identifying relationships among the attributes? There are several such data mining tasks, and each poses a new set of challenges. Control of Data Who controls the data? Is each entity found only at a single site (as with medical insurance records)? Or do different sites contain different types of data (Ford on vehicles, Firestone on tires)? Privacy What are the privacy requirements? If the concern is solely that values associated with an individual entity not be released (e.g., personally identifiable information) we can develop techniques that provably protect such information. In other cases, the notion of sensitive may not be known in advance. This would lead to human vetting of the intermediate results. Sometimes it may be difficult (or impossible) to develop an exact solution that meets the privacy constraints. In data mining an approximate solution is often sufficient. The goal, then, is to obtain a solution with bounded error. Intended Audience: This tutorial meets the needs of two audiences. Practitioners will learn to recognize when privacy and security concerns threaten to derail a data mining project. They will become able to develop alternative approaches that enable project completion while meeting privacy constraints. They will learn of technical solutions that enable those approaches, and of contacts within the research community who can develop new solutions where existing ones do not meet their needs. Researchers will become familiar with the current state of research in this relatively new area. They will learn the constraints that lead to privacy and security problems with data mining, enabling them to identify new challenges and develop new solutions in this rapidly developing field. Prerequisites: Participants will need a general knowledge of data mining methods and techniques. Material to be covered: The first half of this tutorial will discuss privacy and security constraints that are likely to affect data mining projects. The second half will introduce technical solutions to these problems: How to obtain data mining results without violating privacy, where normal data mining approaches would require a level of data access that violates privacy and security constraints. Outline: Introduction: Examples of conflict between data mining and privacy, example of simple solution to one example (distributed association rules in horizontally partitioned distributed data where global access to individual values restricted.) Privacy and Security Constraints: Individual privacy, Collection privacy, Limitations on use of results. Sources of Privacy and Security Constraints: Regulatory, Contractual, Secrecy. Classes of solutions: Data obfuscation. Example: United States Census Bureau Public Use Microdata [RAM96]. Summarization. Example: Statistical Queries [Den80]. Data separation. Horizontal vs. vertical partitioning of data. Example: Hospital records. Part 1 summary and break: When privacy concerns should be addressed, in terms of the CRoss Industry Standard Process for Data Mining (CRISP-DM). Articulating that data mining results rarely violate these concerns. The problem is for the data miners to obtain access to the data. Setup for Part 2: technical solutions to obtain data mining results in spite of limited data access. Data obfuscation based techniques. Classifier generation method of [AS00]. Association rule mining technique of [ESAG02]. Data separation based techniques: Computing results without sharing data. Overview of secure multiparty computation. Definition of secure, semi-honest and malicious model. Brief overview of the general two-party protocol [Yao86]. Secure decision tree construction [LP00]. Secure association rules: Horizontally partitioned data [KC02], Vertically partitioned data [VC02]. Biography: Chris Clifton is an Associate Professor of Computer Science at Purdue University. He has a Ph.D. from Princeton University, and Bachelor's and Master's degrees from the Massachusetts Institute of Technology. Prior to joining Purdue in 2001, Chris had served as a Principal Scientist at The MITRE Corporation and as an Assistant Professor of Computer Science at Northwestern University. His research interests include data mining, data security, database support for text, and heterogeneous databases.
References: [AS00] Rakesh Agrawal and Ramakrishnan Srikant. Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD Conference on Management of Data, Dallas, TX, May 14-19 2000. ACM. [CE02] IEEE International Conference on Data Mining Workshop on Privacy, Security, and Data Mining, Chris Clifton and Vladimir Estivill-Castro, eds., Maebashi City, Japan, December 9, 2002. [CM96] Chris Clifton and Don Marks. Security and privacy implications of data mining. In Workshop on Data Mining and Knowledge Discovery, pages 15-19, Montreal, Canada, June 2 1996. ACM SIGMOD. [Den80] Dorothy E. Denning. Secure statistical databases with random sample queries. ACM Transactions on Database Systems, 5(3):291-315, September 1980. [ESAG02] Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, and Johannes Gehrke, Privacy Preserving Mining of Association Rules in The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 23-26, 2002. [Geh03] Special Section on Privacy and Security in SIGKDD Explorations 4(2), January 2003, Johannes Gehrke, ed. [KC02] Murat Kantarcioglu and Chris Clifton. Distributed association rule mining without sharing. Submitted to Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'02). ACM SIGMOD, June 2 2002. [LP00] Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. In Advances in Cryptology - CRYPTO 2000, pages 36-54. Springer-Verlag, August 20-24 2000. [RAM96] Richard A. Moore, Jr. Controlled data-swapping techniques for masking public use microdata sets. Statistical Research Division Report Series RR 96-04, U.S. Bureau of the Census, Washington, DC., 1996. [VC02] Jaideep Shrikant Vaidya and Chris Clifton. Privacy preserving association rule mining in vertically partitioned data. Submitted to The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26 2002. [Yao86] Andrew C. Yao. How to generate and exchange secrets. In Proceedings of the 27th IEEE Symposium on Foundations of Computer Science, pages 162-167. IEEE, 1986. |
| VI. Sequence Data Mining Techniques and Applications |
|
Title: Sequence Data Mining Techniques and Applications
Presenters: Sunita Sarawagi
Mark Craven Abstract: Many interesting real-life mining applications rely on modeling data as sequences of discrete multi-attribute records. Sequences are the primary data types in several sensor and monitoring applications. Mining models for network intrusion detection view data as sequences of TCP/IP packets. Text information extraction systems model the input text as a sequence of words and delimiters. Customer data mining applications profile buying habits of customers as a sequence of items purchased. In computational biology, DNA, RNA and protein data are all naturally modeled as sequences. Existing literature on sequence mining tends to be partitioned on application-specific boundaries. In this tutorial we will attempt to distill the basic operations and techniques that are common to these applications. These include conventional mining operations like classification, clustering and frequent pattern mining and sequence specific operations like tagging, segmentation and prediction. We will present case studies from network intrusion detection and biological sequence mining to illustrate these techniques. Audience: Researchers interested in core learning techniques for sequences and in applications like bio-informatics, network intrusion detection, information extraction, time series analysis, sensor data mining and customer modeling. Basic knowledge of core mining operations is helpful but not mandatory. Biographies: Sunita Sarawagi is an associate professor at the School of Information Technology at IIT Bombay since Feb 1999. Prior to that she was a research staff member in the database department of IBM Almaden Research Center. She got her PhD from the University of California at Berkeley. Her research interests include database mining, machine learning, data warehousing and database query processing. She has several publications in international conferences on databases and data mining. She has served as program committee member for SIGMOD, VLDB, SIGKDD, ICDE and ICML conferences and is associate editor of the ACM SIGKDD newsletter. Mark Craven is an Assistant Professor in the Department of Biostatistics and Medical Informatics and in the Department of Computer Sciences at the University of Wisconsin. His research interests are centered in machine learning and bioinformatics. He has more than 30 publications in these areas. Mark served as co-chair of KDD Cup in 2002, is on the editorial board of the Machine Learning journal, and was awarded an NSF CAREER award in 2001. His current research projects involve developing computational methods for automatically mining the biomedical literature, and for uncovering gene-regulatory networks in bacterial genomes. |
| VII. The Top 10 Data Mining Mistakes -- and How to Avoid Them |
|
Title:The Top Ten Data Mining Mistakes
-- and How to Avoid Them
Presenter:
John Elder, Ph.D. Abstract: The tutorial will reveal the top mistakes we Data Miners can make, from the simple to the subtle, using case studies of real projects and the (often overlooked) symptoms that suggested something might be amiss. The goal will be to learn "best practices" from their flip side -- mistakes. (But we also should have time for brief summaries of how to do it right.) Mistakes to be covered: Lack data, Focus on training, Rely on one technique, Ask the wrong question, Listen (only) to the data, Accept leaks from the future, Discount pesky cases, Extrapolate (practically and theoretically), Answer every inquiry, Sample without care, Believe the best model. Audience: The best background for attendees to have is a problem they want to solve and experience trying any analysis technique. We'll focus on how to think rightly about a problem, and not on technical equations or terms. The practical illustrations emphasize the "uncommon common sense" necessary to practice well the art of Data Mining. Biography: Dr. John Elder heads a small Data Mining firm with offices in Charlottesville, Virginia, and Washington, DC. John earned degrees in Electrical Engineering at Rice University, then worked in the Defense consulting industry for 5 years, where he authored an early Data Mining tool for the Air Force which led to improved guidance and flight control applications. He then earned a Ph.D. in Systems Engineering from the University of Virginia while working as Director of Research for an investment management firm, and wrote an influential tool for global optimization. After two years post-doctoral research at Rice in the Computational and Applied Mathematics Department, John returned to Virginia and started Elder Research, Inc. in 1995, where he's led projects successfully applying Data Mining to a wide variety of financial, commercial, and medical applications -- including cross-selling, customer segmentation, direct marketing, credit scoring, sales forecasting, stock selection, drug efficacy, biometrics, market timing, and fraud detection. Dr. Elder has written several book chapters and articles on pattern discovery techniques, and is a frequently invited conference speaker. He is active on Statistical and Engineering journals and boards, and his popular Data Mining courses are acclaimed for clarity. He has been named to Who's Who in the World for his contributions to the field. Dr. Elder has been honored, since Fall 2001, to serve on Panel formed by Congress to guide critical defense technology for the National Security Agency.
|