Tutorial T9: Real World Text Mining
The proliferation of documents available on the Web and on corporate intranets is driving a new wave of text mining research and application. Earlier research addressed extraction of information from relatively small collections of well-structured documents such as newswire or scientific publications. Text mining from the other corpora such as the web requires new techniques drawn from data mining, machine learning, NLP and IR. Text mining requires preprocessing document collections (text categorization, information extraction, term extraction), storage of the intermediate representations, analysis of these intermediate representations (distribution analysis, clustering, trend analysis, association rules, etc.), and visualization of the results. In this tutorial we will present the algorithms and methods used to build text mining systems including pre-processing techniques, supervised leearning (e.g., CRF), entity resolution, relationship extraction, unsupervised learning and machine reading.
The tutorial will cover the state of the art in this rapidly growing area of research, including recent advances in unsupervised methods for extracting facts from text and methods used for web-scale mining. We will also present several real world applications of text mining. Special emphasis will be given to lessons learned from years of experience in developing real world text mining systems, including how to handle informal texts such as blogs and user reviews and how to build scalable systems.
The instructors are Ronen Feldman and Lyle Ungar. Ronen is an Associate Professor of Information Systems at the Business School of the Hebrew University in Jerusalem. He is the founder of the ClearForest text mining corporation, and the author of the book "The Text Mining Handbook" published by Cambridge University Press in 2007. Lyle is an Associate Professor of Computer and Information Science at the University of Pennsylvania.He recently returned from a sabbatical at Google, where he and a team built what is probably the world’s largest named entity recognition system.