Tutorial T6: New Directions in Data Quality Mining
As data types and data structures change to keep up with evolving technologies and applications, data quality problems too have evolved and become more complex and interwoven. Data streams, web logs, Wikipedias, biomedical applications, video streams and social networking websites generate a mind boggling variety of data types. However, data quality mining, the use of data mining to manage, measure and improve data quality, has focused mostly on addressing each category of data glitch separately as a static entity.
In this tutorial we provide a technical, KDD-focused account of recent research and developments in discovering and treating complex data anomalies in a broad range of data. In particular, we highlight new directions in data quality mining: (a) the applicability and effectiveness of the methodologies for various data types such as structured, semi-structured and stream data, (b) the detection of concomitant data glitches and patterns like the occurrence of outliers in data with missing values and duplicates, or the co-occurrence of missing values and duplicates, (c) the design of sequential approaches to data quality mining, such as workflows composed of a sequence of tasks for data quality exploration and analysis. We give an overview of past work, introduce current research in this area including recent methods and techniques for discovering complex patterns of anomalies (e.g., multivariate outliers, disguised missing values, combination of different types of noise), and highlight new directions and open problems in data quality mining.
The tutorial includes extensive case studies and practical examples of mining data quality problems for a variety of large datasets and data types e.g., relational, XML, data streams. We discuss illustrative examples drawn from a variety of domains like CRM, networking, biology, and mobility.