BEGIN:VCALENDAR
PRODID;X-RICAL-TZSOURCE=TZINFO:-//com.denhaven2/NONSGML ri_cal gem//EN
CALSCALE:GREGORIAN
VERSION:2.0
METHOD:PUBLISH
X-WR-TIMEZONE::Europe/Paris
BEGIN:VTIMEZONE
TZID;X-RICAL-TZSOURCE=TZINFO:Europe/Paris
BEGIN:DAYLIGHT
DTSTART:20090329T020000
RDATE:20090329T020000
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T145000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T142500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4888
DESCRIPTION:This paper studies web object classification problem with the
  novel exploration of social tags. Automatically classifying web objects
  into manageable semantic categories has long been a fundamental preproc
 ess for indexing\, browsing\, searching\, and mining these objects. The 
 explosive growth of heterogeneous web objects\, especially non-textual o
 bjects such as products\, pictures\, and videos\, has made the problem o
 f web classification increasingly challenging. Such objects often suffer
  from a lack of easy-extractable features with semantic information\, in
 terconnections between each other\, as well as training examples with ca
 tegory labels.\n\nIn this paper\, we explore the social tagging data to 
 bridge this gap. We cast web object classification problem as an optimiz
 ation problem on a graph of objects and tags. We then propose an efficie
 nt algorithm which not only utilizes social tags as enriched semantic fe
 atures for the objects\, but also infers the categories of unlabeled obj
 ects from both homogeneous and heterogeneous labeled objects\, through t
 he implicit connection of social tags. Experiment results show that the 
 exploration of social tags effectively boosts web object classification.
  Our algorithm significantly outperforms the state-of-the-art of general
  classification methods.
SUMMARY:Exploring Social Tagging Graph for Web Object Classification
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T165000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T162500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4889
DESCRIPTION:User browsing information\, particularly their non-search rel
 ated activity\, reveals important contextual information on the preferen
 ces and the intent of web users. In this paper\, we expand the use of br
 owsing information for web search ranking and other applications\, with 
 an emphasis on analyzing individual user sessions for creating aggregate
  models. In this context\, we introduce ClickRank\, an efficient\, scala
 ble algorithm for estimating web page and web site importance from brows
 ing information. We lay out the theoretical foundation of ClickRank base
 d on an intentional surfer model and analyze its properties. We evaluate
  its effectiveness for the problem of web search ranking\, showing that 
 it contributes significantly to retrieval performance as a novel web sea
 rch feature. We demonstrate that the results produced by ClickRank for w
 eb search ranking are highly competitive with those produced by other ap
 proaches\, yet achieved at better scalability and substantially lower co
 mputational costs. Finally\, we discuss novel applications of ClickRank 
 in providing enriched user web search experience\, highlighting the usef
 ulness of our approach for non-ranking tasks.
SUMMARY:Mining Rich Session Context to Improve Web Search
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T120000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T114500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4890
DESCRIPTION:Co-clustering is based on the duality between data points (e.
 g. documents) and features (e.g. words)\, i.e. data points can be groupe
 d based on their distribution on features\, while features can be groupe
 d based on their distribution on the data points. In the past decade\, s
 everal co-clustering algorithms have been proposed and shown to be super
 ior to traditional one-side clustering. However\, existing co-clustering
  algorithms fail to consider the geometric structure in the data\, which
  is essential for clustering data on manifold. To address this problem\,
  in this paper\, we propose a Dual Regularized Co-Clustering (DRCC) meth
 od based on semi-nonnegative matrix tri-factorization. We deem that not 
 only the data points\, but also the features are sampled from some manif
 olds\, namely data manifold and feature manifold respectively. As a resu
 lt\, we construct two graphs\, i.e. data graph and feature graph\, to ex
 plore the geometric structure of data manifold and feature manifold. The
 n our co-clustering method is formulated as semi-nonnegative matrix tri-
 factorization with two graph regularizers\, requiring that the cluster l
 abels of data points are smooth with respect to the data manifold\, whil
 e the cluster labels of features are smooth with respect to the feature 
 manifold. We will show that DRCC can be solved via alternating minimizat
 ion\, and its convergence is theoretically guaranteed. Experiments of cl
 ustering on many benchmark data sets demonstrate that the proposed metho
 d outperforms many state of the art clustering methods.
SUMMARY:Co-Clustering on Manifolds
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T170500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T165000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4891
DESCRIPTION:The central challenge in temporal data analysis  is to obtain
  knowledge about its underlying dynamics. In this paper\, we address the
  observation of noisy\, stochastic processes and attempt to detect tempo
 ral segments that are related to inconsistencies and irregularities in i
 ts dynamics. Many conventional anomaly detection approaches detect anoma
 lies based on the distance between patterns\, and often provide only lim
 ited intuition  about the generative process of the anomalies. Meanwhile
 \, model-based approaches have difficulty in identifying a small\, clust
 ered set of anomalies.\n\nWe propose Information-theoretic Meta-clusteri
 ng (ITMC)\, a formalization of model-based clustering principled by the 
 theory of lossy data compression. ITMC identifies a `unique' cluster who
 se distribution diverges significantly from the entire dataset. Furtherm
 ore\, ITMC employs a regularization term derived from the preference for
  high compression rate\, which is critical to the precision of detection
 .\n\nFor empirical evaluation\, we apply ITMC to two temporal anomaly de
 tection tasks. Datasets are taken from generative processes involving he
 terogeneous and inconsistent dynamics. A comparison to baseline methods 
 shows that the proposed algorithm detects segments from irregular states
  with significantly high precision and recall.
SUMMARY:Detection of Unique Temporal Segments by Information Theoretic Me
 ta-clustering
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T170500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T165000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4892
DESCRIPTION:We present a new approach to large-scale graph mining based o
 n so-called backbone refinement classes. The method efficiently mines tr
 ee-shaped subgraph descriptors under minimum frequency and significance 
 constraints\, using classes of fragments to reduce feature set size and 
 running times.\n\nThe classes are defined in terms of fragments sharing 
 a common backbone. The method is able to optimize structural inter-featu
 re entropy as opposed to occurrences\, which is characteristic for open 
 or closed fragment mining. In the experiments\, the proposed method redu
 ces feature set sizes by >90 % and >30 % compared to complete tree minin
 g and open tree mining\, respectively. Evaluation using crossvalidation 
 runs shows that their classification accuracy is similar to the complete
  set of trees but significantly better than that of open trees. Compared
  to open or closed fragment mining\, a large part of the search space ca
 n be pruned due to an improved statistical constraint (dynamic upper bou
 nd adjustment)\, which is also confirmed in the experiments in lower run
 ning times compared to ordinary (static) upper bound pruning. Further an
 alysis using large-scale datasets yields insight into important properti
 es of the proposed descriptors\, such as the dataset coverage and the cl
 ass size represented by each descriptor. A final cross-validation run co
 nfirms that the novel descriptors render large training sets feasible wh
 ich previously might have been intractable.
SUMMARY:Large-Scale Graph Mining Using Backbone Refinement Classes
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T105500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4893
DESCRIPTION:We study in this paper the problem of incremental crawling of
  web forums\, which is a very fundamental yet challenging step in many w
 eb applications. Traditional approaches mainly focus on scheduling the r
 evisiting strategy of each individual page. However\, simply assigning d
 ifferent weights for different individual pages is usually inefficient i
 n crawling forum sites because of the different characteristics between 
 forum sites and general websites. Instead of treating each individual pa
 ge independently\, we propose a list-wise strategy by taking into accoun
 t the site-level knowledge. Such site-level knowledge is mined through r
 econstructing the linking structure\, called sitemap\, for a given forum
  site. With the sitemap\, posts from the same thread but distributed on 
 various pages can be concatenated according to their timestamps. After t
 hat\, for each thread\, we employ a regression model to predict the time
  when the next post arrives. Based on this model\, we develop an efficie
 nt crawler which is 260% faster than some state-of-the-art methods in te
 rms of fetching new generated content\; and meanwhile our crawler also e
 nsure a high coverage ratio. Experimental results show promising perform
 ance of Coverage\, Bandwidth utilization\, and Timeliness of our crawler
  on 18 various forums.
SUMMARY:Incorporating Site-Level Knowledge for Incremental Crawling of We
 b Forums: A List-wise Strategy
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T120000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T114500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4894
DESCRIPTION:This paper studies the problem of  frequent pattern mining wi
 th uncertain data. We will show how broad classes of algorithms can be e
 xtended to the uncertain data setting. In particular\, we will study can
 didate generate-and-test algorithms\, hyper-structure algorithms  and  p
 attern growth based algorithms. One of our insightful  observations is t
 hat the experimental behavior of different classes of algorithms is very
  different in the uncertain case as compared to the deterministic case. 
 In particular\, the hyper-structure and the candidate generate-and-test 
 algorithms perform much better than  tree-based algorithms. This counter
 -intuitive behavior is an important observation from the perspective of 
  algorithm design of the uncertain variation of the problem. We will tes
 t the approach on a number of real and synthetic data sets\, and show th
 e effectiveness of two of our approaches over competitive techniques.\n\
 nExecutable and Data Sets:  Available at: http://dbgroup.cs.tsinghua.edu
 .cn/liyan/u_mining.tar.gz
SUMMARY:Frequent Pattern Mining with Uncertain Data
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T114500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T112000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4895
DESCRIPTION:Rock art is an archaeological term for human-made markings on
  stone. It is believed that there are millions of petroglyphs in North A
 merica alone\, and the study of this valued cultural resource has implic
 ations even beyond anthropology and history. Surprisingly\, although ima
 ge processing\, information retrieval and data mining have had large imp
 acts on many human endeavors\, they have had essentially zero impact on 
 the study of rock art. In this work we identify the reasons for this\, a
 nd introduce a novel distance measure and algorithms which allow efficie
 nt and effective data mining of large collections of rock art.
SUMMARY:Augmenting the Generalized Hough Transform to Enable the Mining o
 f Petroglyphs
LOCATION:La Seine C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T165000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T162500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4896
DESCRIPTION:Classification of time series has been attracting great inter
 est over the past decade. Recent empirical evidence has strongly suggest
 ed that the simple nearest neighbor algorithm is very difficult to beat 
 for most time series problems. While this may be considered good news\, 
 given the simplicity of implementing the nearest neighbor algorithm\, th
 ere are some negative consequences of this. First\, the nearest neighbor
  algorithm requires storing and searching the entire dataset\, resulting
  in a time and space complexity that limits its applicability\, especial
 ly on resource-limited sensors. Second\, beyond mere classification accu
 racy\, we often wish to gain some insight into the data. \nIn this work 
 we introduce a new time series primitive\, time series shapelets\, which
  addresses these limitations. Informally\, shapelets are time series sub
 sequences which are in some sense maximally representative of a class. A
 s we shall show with extensive empirical evaluations in diverse domains\
 , algorithms based on the time series shapelet primitives can be interpr
 etable\, more accurate and significantly faster than state-of-the-art cl
 assifiers.
SUMMARY:Time Series Shapelets: A New Primitive for Data Mining
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T162500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T160000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4897
DESCRIPTION:Spatial classification is the task of learning models to pred
 ict class labels based on the features of entities as well as the spatia
 l relationships to other entities and their features. Spatial data can b
 e represented as multi-relational data\, however it presents novel chall
 enges not present in multi-relational problems. One such problem is that
  spatial relationships are embedded in space\, unknown a priori\, and it
  is part of the algorithms task to determine which relationships are im
 portant and what properties to consider. In order to determine when two 
 entities are spatially related in an adaptive and non-parametric way\, w
 e propose a Voronoi-based neighbourhood definition upon which spatial li
 terals can be built. Properties of these neighbourhoods also need to be 
 described and used for classification purposes. Non-spatial aggregation 
 literals already exist within the multi-relational framework\, but are n
 ot sufficient for comprehensive spatial classification. A formal set of 
 additions to the multi-relational data mining framework is proposed\, to
  be able to represent spatial aggregations as well as spatial features a
 nd literals. These additions allow for capturing more complex interactio
 ns and spatial occurrences such as spatial trends. In order to more effi
 ciently perform the rule learning and exploit powerful multi-processor m
 achines\, a scalable parallelized method capable of reducing the runtime
  by several factors is presented. The method is compared against existin
 g methods by experimental evaluation on a real world crime dataset which
  demonstrate the importance of the neighbourhood definition and the adva
 ntages of parallelization.
SUMMARY:A Multi-Relational Approach to Spatial Classification
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T172000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T170500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4898
DESCRIPTION:Counting the number of triangles in a graph is a beautiful al
 gorithmic problem which has gained importance over the last years due to
  its  significant role in complex network analysis.  Metrics frequently 
 computed such as the clustering coefficient  and the transitivity ratio 
 involve the execution of a triangle  counting algorithm. Furthermore\, s
 everal interesting graph mining applications rely on computing the numbe
 r of triangles in the graph of interest. \n\nIn this paper\, we focus on
  the problem of counting triangles in a graph. We propose a practical me
 thod\, out of which all triangle counting algorithms can potentially ben
 efit. Using a straight-forward triangle counting algorithm as a black bo
 x\, we performed 166 experiments on real-world networks and on synthetic
  datasets as well\, where we  show that our method works with high accur
 acy\, typically more than 99\\% and gives significant speedups\, resulti
 ng in even $\\approx$ 130 times faster performance.
SUMMARY:Doulion: Counting triangles in massive graphs with a coin
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T123500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T122000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4899
DESCRIPTION:This paper addresses Named Entity Mining (NEM)\, in which we 
 mine knowledge about named entities such as movies\, games\, and books f
 rom a huge amount of data. NEM is potentially useful in many application
 s including web search\, online advertisement\, and recommender system. 
 There are three challenges for the task: finding suitable data source\, 
 coping with the ambiguities of named entity classes\, and incorporating 
 necessary human supervision into the mining process. This paper proposes
  conducting NEM by using click-through data collected at a web search en
 gine\, employing a topic model that generates the click-through data\, a
 nd learning the topic model by weak supervision from humans. Specificall
 y\, it characterizes each named entity by its associated queries and URL
 s in the click-through data. It uses the topic model to resolve ambiguit
 ies of named entity classes by representing the classes as topics. It em
 ploys a method\, referred to as Weakly Supervised Latent Dirichlet Alloc
 ation (WS-LDA)\, to accurately learn the topic model with partially labe
 led named entities. Experiments on a large scale click-through data cont
 aining over 1.5 billion query-URL pairs show that the proposed approach 
 can conduct very accurate NEM and significantly outperforms the baseline
 .
SUMMARY:Named Entity Mining from Click-Through Data Using Weakly Supervis
 ed Latent Dirichlet Allocation
LOCATION:Louis Armstrong A\,B\,C+D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T105500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4900
DESCRIPTION:BEST RESEARCH PAPER AWARD WINNER\n\nCustomer preferences for 
 products are drifting over time. Product perception and popularity are c
 onstantly changing as new selection emerges. Similarly\, customer inclin
 ations are evolving\, leading them to ever redefine their taste. Thus\, 
 modeling temporal dynamics should be a key when designing recommender sy
 stems or general customer preference models. However\, this raises uniqu
 e challenges. Within the eco-system intersecting multiple products and c
 ustomers\, many different characteristics are shifting simultaneously\, 
 while many of them influence each other and often those shifts are delic
 ate and associated with a few data instances. This distinguishes the pro
 blem from concept drift explorations\, where mostly a single concept is 
 tracked. Classical time-window or instance-decay approaches cannot work\
 , as they lose too much signal when discarding data instances.  A more s
 ensitive approach is required\,  which can make better distinctions betw
 een transient effects and long term patterns. The paradigm we offer is c
 reating a model tracking the time changing behavior throughout the life 
 span of the data. This allows us to exploit the relevant components of a
 ll data instances\, while discarding only what is modeled as being irrel
 evant. Accordingly\, we revamp two leading collaborative filtering recom
 mendation approaches. Evaluation is made on a large movie rating dataset
  by Netflix. Results are encouraging and better than those previously re
 ported on this dataset.
SUMMARY:Collaborative Filtering with Temporal Dynamics
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T122000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T120500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4901
DESCRIPTION:Automatic news extraction from news pages is important in man
 y Web applications such as news aggregation. However\, the existing news
  extraction methods based on template-level wrapper induction have three
  serious limitations. First\, the existing methods cannot correctly extr
 act pages belonging to an unseen template. Second\, it is costly to main
 tain up-to-date wrappers for a large amount of news websites\, because a
 ny change of a template may invalidate the corresponding wrapper. Last\,
  the existing methods can merely extract unformatted plain texts\, and t
 hus are not user friendly. In this paper\, we tackle the problem of temp
 late-independent Web news extraction in a user-friendly way. We formaliz
 e Web news extraction as a machine learning problem and learn a template
 -independent wrapper using a very small number of labeled news pages fro
 m a single site. Novel features dedicated to news titles and bodies are 
 developed. Correlations between news titles and news bodies are exploite
 d. Our template-independent wrapper can extract news pages from differen
 t sites regardless of templates. Moreover\, our approach can extract not
  only texts\, but also images and animates within the news bodies and th
 e extracted news articles are in the same visual style as in the origina
 l pages. In our experiments\, a wrapper learned from 40 pages from a sin
 gle news site achieved an accuracy of 98.1% on 3\,973 news pages from 12
  news sites.
SUMMARY:Can We Learn a Template-Independent Wrapper for News Article Extr
 action from a Single Training Site?
LOCATION:Louis Armstrong A\,B\,C+D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T162500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T160000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4902
DESCRIPTION:Given a spatial data set placed on an n × n grid\, our goal 
 is to find the rectangular regions within which subsets of the data set 
 exhibit anomalous behavior. We develop algorithms that\, given any user-
 supplied arbitrary likelihood function\, conduct a likelihood ratio hypo
 thesis test (LRT) over each rectangular region in the grid\, rank all of
  the rectangles based on the computed LRT statistics\, and return the to
 p few most interesting rectangles. To speed this process\, we develop me
 thods to prune rectangles without computing their associated LRT statist
 ics.
SUMMARY:A LRT Framework for Fast Spatial Anomaly Detection
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T105500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4903
DESCRIPTION:Clustering validation is a long standing challenge in the clu
 stering literature. While many validation measures have been developed f
 or evaluating the performance of clustering algorithms\, these measures 
 often provide inconsistent information about the clustering performance 
 and the best suitable measures to use in practice remain unknown. This p
 aper thus fills this crucial void by giving an organized study of 16 ext
 ernal validation measures for K-means clustering. Specifically\, we firs
 t introduce the importance of measure normalization in the evaluation of
  the clustering performance on data with imbalanced class distributions.
  We also provide normalization solutions for several measures. In additi
 on\, we summarize the major properties of these external measures. These
  properties can serve as the guidance for the selection of validation me
 asures in different application scenarios. Finally\, we reveal the inter
 relationships among these external measures. By mathematical transformat
 ion\, we show that some validation measures are equivalent. Also\, some 
 measures have consistent validation performances. Most importantly\, we 
 provide a guide line to select the most suitable validation measures for
  K-means clustering.
SUMMARY:Adapting the Right Measures for K-means Clustering
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T121500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4904
DESCRIPTION:Data analytics tools and frameworks abound\, yet rapid deploy
 ment of analytics solutions that deliver actionable insights from busine
 ss data remains a challenge. The primary reason is that on-field practit
 ioners are required to be both technically proficient and knowledgeable 
 about the business. \n\nThe recent abundance of unstructured business da
 ta has thrown up new opportunities for analytics\, but has also multipli
 ed the deployment challenge\, since interpretation of concepts derived f
 rom textual sources require a deep understanding of the business.  In su
 ch a scenario\, a managed service for analytics comes up as the best alt
 ernative. A managed analytics service is centered around a business anal
 yst who acts as a liaison between the business and the technology.  This
  calls for new tools that assist the analyst to be {\\it efficient} in t
 he tasks that she needs to execute.  Also\, the analytics needs to be {\
 \it repeatable}\, in that the delivered insights should not depend heavi
 ly on the expertise of specific analysts.  These factors lead us to iden
 tify new areas that open up for KDD research in terms of `time-to-insigh
 t' and repeatability for these analysts.  We present our analytics frame
 work in the form of a managed service offering for CRM analytics. We des
 cribe different analyst-centric tools using a case study from real-life 
 engagements and demonstrate their effectiveness.
SUMMARY:Enabling Analysts in Managed Services for CRM Analytics
LOCATION:La Seine C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T120000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T114500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4905
DESCRIPTION:Identifying similar keywords\, known as broad matches\, is an
  important task in online advertising that has become a standard feature
  on all major keyword advertising platforms. Effective broad matching le
 ads to improvements in both relevance and monetization\, while increasin
 g advertisers' reach and making campaign management easier. In this pape
 r\, we present a learning-based approach to broad matching that is based
  on exploiting implicit feedback in the form of advertisement clickthrou
 gh logs. Our method can utilize arbitrary similarity functions by incorp
 orating them as features. We present an online learning algorithm\, Amne
 siac Averaged Perceptron\, that is highly efficient yet able to quickly 
 adjust to the rapidly-changing distributions of bidded keywords\, advert
 isements and user behavior. Experimental results obtained from (1) histo
 rical logs and (2) live trials on a large-scale advertising platform dem
 onstrate the effectiveness of the proposed algorithm and the overall suc
 cess of our approach in identifying high-quality broad match mappings.
SUMMARY:Catching the Drift: Learning Broad Matches from Clickthrough Data
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T105500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4906
DESCRIPTION:There is a wide variety of data mining methods available\, an
 d it is generally useful in exploratory data analysis to use many differ
 ent methods for the same dataset. This\, however\, leads to the problem 
 of whether the results found by one method are a reflection of the pheno
 menon shown by the results of another method\, or whether the results de
 pict in some sense unrelated properties of the data. For example\, using
  clustering can give indication of a clear cluster structure\, and compu
 ting correlations between variables can show that there are many signifi
 cant correlations in the data. However\, it can be the case that the cor
 relations are actually determined by the cluster structure.\n\nIn this p
 aper\, we consider the problem of randomizing data so that previously di
 scovered patterns or models are taken into account. The randomization me
 thods can be used in iterative data mining. At each step in the data min
 ing process\, the randomization produces random samples from the set of 
 data matrices satisfying the already discovered patterns or models. That
  is\, given a data set and some statistics (e.g.\, cluster centers or co
 -occurrence counts) of the data\, the randomization methods sample data 
 sets having similar values of the given statistics as the original data 
 set. We use Metropolis sampling based on local swaps to achieve this. We
  describe experiments on real data that demonstrate the usefulness of ou
 r approach. Our results indicate that in many cases\, the results of\, e
 .g.\, clustering actually imply the results of\, say\, frequent pattern 
 discovery.
SUMMARY:Tell Me Something I Don't Know: Randomization Strategies for Iter
 ative Data Mining
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T120500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T115000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4907
DESCRIPTION:Web content analysis often has two sequential and separate st
 eps: Web Classification to identify the target Web pages\, and Web Infor
 mation Extraction to extract the metadata contained in the target Web pa
 ges. This decoupled strategy is highly ineffective since the errors in W
 eb classification will be propagated to Web information extraction and e
 ventually accumulate to a high level. In this paper we study the mutual 
 dependencies between these two steps and propose to combine them by usin
 g a model of Conditional Random Fields (CRFs). This model can be used to
  simultaneously recognize the target Web pages and extract the correspon
 ding metadata. Systematic experiments in our project OfCourse for online
  course search show that this model significantly improves the F1 value 
 for both of the two steps. We believe that our model can be easily gener
 alized to many Web applications.
SUMMARY:Towards Combining Web Classification and Web Information Extracti
 on: A Case Study
LOCATION:Louis Armstrong A\,B\,C+D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T153000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T151500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4908
DESCRIPTION:When labeled examples are limited and difficult to obtain\, t
 ransfer learning employs knowledge from a source domain to improve learn
 ing accuracy in the target domain. However\, the assumption made by exis
 ting approaches\, that the marginal and conditional probabilities are di
 rectly related between source and target domains\, has limited applicabi
 lity in either the original space or its linear\ntransformations. To sol
 ve this problem\, we propose an adaptive kernel approach that maps the m
 arginal distribution of target-domain and source-domain data into a comm
 on kernel space\, and utilize a sample selection strategy to draw condit
 ional probabilities between the two domains closer. We formally show tha
 t under the kernel-mapping space\, the difference in distributions betwe
 en the\ntwo domains is bounded\; and the prediction error of the propose
 d approach can also be bounded. Experimental results demonstrate that th
 e proposed method outperforms both traditional inductive classifiers and
  the state-of-the-art boosting-based transfer algorithms on most domains
 \, including text categorization and web page ratings. In particular\, i
 t can achieve around 10% higher accuracy than other approaches for the t
 ext categorization problem. The source code and datasets are available f
 rom the authors.
SUMMARY:Cross Domain Distribution Adaptation via Kernel Mapping
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T113500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T112000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4909
DESCRIPTION:Learning probabilistic graphical models from high-dimensional
  datasets is a computationally challenging task. In many interesting app
 lications\, the domain dimensionality is such as to prevent state-of-the
 -art statistical learning techniques from delivering accurate models in 
 reasonable time. This paper presents a hybrid random field model for pse
 udo-likelihood estimation in high-dimensional domains. A theoretical ana
 lysis proves that the class of pseudo-likelihood distributions represent
 able by hybrid random fields strictly includes the class of joint probab
 ility distributions representable by Bayesian networks. In order to lear
 n hybrid random fields from data\, we develop the Markov Blanket Merging
  algorithm. Theoretical and experimental evidence shows that Markov Blan
 ket Merging scales up very well to high-dimensional datasets. As compare
 d to other widely used statistical learning techniques\, Markov Blanket 
 Merging delivers accurate results in a number of link prediction tasks\,
  while achieving also significant improvements in terms of computational
  efficiency.\n\nOur software implementation of the models investigated i
 n this paper is publicly available at http://www.dii.unisi.it/~freno/. T
 he same website also hosts the datasets used in this work that are not a
 vailable elsewhere in the same preprocessing used for our experiments.
SUMMARY:Scalable Pseudo-Likelihood Estimation in Hybrid Random Fields
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T115000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T113500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4910
DESCRIPTION:Nowadays\, enormous amounts of data are continuously generate
 d not only in massive scale\, but also from different\, sometimes confli
 cting\, views. Therefore\, it is important to consolidate different conc
 epts for intelligent decision making. For example\, to predict the resea
 rch areas of some people\, the best results are usually achieved by comb
 ining and consolidating predictions obtained from the publication networ
 k\, co-authorship network and the textual content of their publications.
  Multiple supervised and unsupervised hypotheses can be drawn from these
  information sources\, and negotiating their differences and consolidati
 ng decisions usually yields a much more accurate model due to the divers
 ity and heterogeneity of these models. In this paper\, we address the pr
 oblem of "consensus learning" among competing hypotheses\, which either 
 rely on outside knowledge (supervised learning) or internal structure (u
 nsupervised clustering). We argue that consensus learning is an NP-hard 
 problem and thus propose to solve it by an efficient heuristic method. W
 e construct a belief graph to first propagate predictions from supervise
 d models to the unsupervised\, and then negotiate and reach consensus am
 ong them. Their final decision is further consolidated by calculating ea
 ch model's weight based on its degree of consistency with other models. 
 Experiments are conducted on 20 Newsgroups data\, Cora research papers\,
  DBLP author-conference network\, and Yahoo! Movies datasets\, and the r
 esults show that the proposed method improves the classification accurac
 y and the clustering quality measure (NMI) over the best base model by u
 p to 10%. Furthermore\, it runs in time proportional to the number of in
 stances\, which is very efficient for large scale data sets.
SUMMARY:Heterogeneous Source Consensus Learning via Decision Propagation 
 and Negotiation
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T142500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4911
DESCRIPTION:In several applications involving regression or classificatio
 n\, along with making predictions it is important to assess how accurate
  or reliable individual predictions are. This is particularly important 
 in cases where due to finite resources or domain requirements\, one want
 s to make decisions based only on the most reliable rather than on the e
 ntire set of predictions. This paper introduces novel and effective ways
  of ranking predictions by their accuracy for problems involving large-s
 cale\, heterogeneous data with a dyadic structure\, i.e.\, where the ind
 ependent variables can be naturally decomposed into three groups associa
 ted with two sets of elements and their combination. These approaches ar
 e based on modeling the data by a collection of localized models learnt 
 while simultaneously partitioning (co-clustering) the data.  For regress
 ion this leads to the concept of "certainty lift".  We also develop a ro
 bust predictive modeling technique that identifies and models only the m
 ost coherent regions of the data to give high predictive accuracy on the
  selected subset of response values. Extensive experimentation on real l
 ife datasets highlights the utility of our proposed approaches.
SUMMARY:Mining for the Most Certain Predictions from Dyadic Data
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T153000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T151500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4912
DESCRIPTION:The increasing availability of electronic communication data\
 , such as that arising from e-mail exchange\, presents social and inform
 ation scientists with new possibilities for characterizing individual be
 havior and\, by extension\, identifying latent structure in human popula
 tions.  Here\, we propose a model of individual e-mail communication tha
 t is sufficiently rich to capture meaningful variability across individu
 als\, while remaining simple enough to be interpretable.  We show that t
 he model\, a cascading non-homogeneous Poisson process\, can be formulat
 ed as a double-chain hidden Markov model\, allowing us to use an efficie
 nt inference algorithm to estimate the model parameters from observed da
 ta. We then apply this model to two e-mail data sets consisting of 404 a
 nd 6\,164 users\, respectively\, that were collected from two universiti
 es in different countries and years.  We find that the resulting best-es
 timate parameter distributions for both data sets are surprisingly simil
 ar\, indicating that at least some features of communication dynamics ge
 neralize beyond specific contexts.  We also find that variability of ind
 ividual behavior over time is significantly less than variability across
  the population\, suggesting that individuals can be classified into per
 sistent "types".  We conclude that communication patterns may prove usef
 ul as an additional class of attribute data\, complementing demographic 
 and network data\, for user classification and outlier detection---a poi
 nt that we illustrate with an interpretable clustering of users based on
  their inferred model parameters.
SUMMARY:Characterizing Individual Communication Patterns
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T170500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T165000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4913
DESCRIPTION:Many applications in surveillance\, monitoring\, scientific d
 iscovery\, and data cleaning require the identification of anomalies. Al
 though many methods have been developed to identify statistically signif
 icant anomalies\, a more difficult task is to identify anomalies that ar
 e both interesting and statistically significant. Category detection is 
 an emerging area of machine learning that can help address this issue us
 ing a "human-in-the-loop" approach. In this interactive setting\, the al
 gorithm asks the user to label a query data point under an existing cate
 gory or declare the query data point to belong to a previously undiscove
 red category. The goal of category detection is to bring to the user's a
 ttention a representative data point from each category in the data in a
 s few queries as possible. In a data set with imbalanced categories\, th
 e main challenge is in identifying the rare categories or anomalies\; he
 nce\, the task is often referred to as {\\it rare} category detection. W
 e present a new approach to rare category detection based on hierarchica
 l mean shift. In our approach\, a hierarchy is created by repeatedly app
 lying mean shift with an increasing bandwidth on the data. This hierarch
 y allows us to identify anomalies in the data set at different scales\, 
 which are then posed as queries to the user. The main advantage of this 
 methodology over existing approaches is that it does not require any kno
 wledge of the dataset properties such as the total number of categories 
 or the prior probabilities of the categories. Results on real-world data
  sets show that our hierarchical mean shift approach performs consistent
 ly better than previous techniques.
SUMMARY:Category Detection Using Hierarchical Mean Shift
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T150500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T145000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4914
DESCRIPTION:Controlled experiments\, also called randomized experiments a
 nd A/B tests\, have had a profound influence on multiple fields\, includ
 ing medicine\, agriculture\, manufacturing\, and advertising. While the 
 theoretical aspects of offline controlled experiments have been well stu
 died and documented\, the practical aspects of running them in online se
 ttings\, such as web sites and services\, are still being developed. As 
 the usage of controlled experiments grows in these online settings\, it 
 is becoming more important to understand the opportunities and pitfalls 
 one might face when using them in practice. A survey of online controlle
 d experiments and lessons learned were previously documented in Controll
 ed Experiments on the Web: Survey and Practical Guide (Kohavi\, et al.\,
  2009). In this follow-on paper\, we focus on pitfalls we have seen afte
 r running numerous experiments at Microsoft.  The pitfalls include a wid
 e range of topics\, such as assuming that common statistical formulas us
 ed to calculate standard deviation and statistical power can be applied 
 and ignoring robots in analysis (a problem unique to online settings). O
 nline experiments allow for techniques like gradual ramp-up of treatment
 s to avoid the possibility of exposing many customers to a bad (e.g.\, b
 uggy) Treatment. With that ability\, we discovered that its easy to inc
 orrectly identify the winning Treatment because of Simpsons paradox.
SUMMARY:Seven Pitfalls to Avoid when Running Controlled Experiments on th
 e Web
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T115000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T112500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4915
DESCRIPTION:Learning from data streams is a research area of increasing i
 mportance. Nowadays\, several stream learning algorithms have been devel
 oped. Most of them learn decision models that continuously evolve over t
 ime\, run in resource-aware environments\, detect and react to changes i
 n the environment generating data. One important issue\, not yet conveni
 ently addressed\, is the design of experimental work to evaluate and com
 pare decision models that evolve over time.  There are no golden standar
 ds for assessing performance in non-stationary environments. This paper 
 proposes a general framework for assessing predictive stream learning al
 gorithms. We defend the use of Predictive Sequential methods for error e
 stimate -- the prequential error. The prequential error allows us to mon
 itor the evolution of the performance of models that evolve over time. N
 evertheless\, it is known to be a pessimistic estimator in comparison to
  holdout estimates. To obtain more reliable estimators we need some forg
 etting mechanism. Two viable alternatives are: sliding windows and fadin
 g factors. We observe that the prequential error converges to an holdout
  estimator when estimated over a sliding window or using fading factors.
  %A similar observation applies for fading factors. We present illustrat
 ive examples of the use of prequential error estimators\, using fading f
 actors\, for the tasks of: \n i. assessing performance of a learning alg
 orithm\; \n ii. comparing learning algorithms\;\n iii. hypothesis testin
 g using McNemar test\; and \n iv. change detection using Page-Hinkley te
 st.\n\nIn these tasks\, the prequential error estimated using fading fac
 tors provide reliable estimators.  In comparison to sliding windows\, fa
 ding factors are faster and memory-less\, a requirement for streaming ap
 plications. This paper is a contribution to a discussion in the good-pra
 ctices on performance assessment when learning dynamic models that evolv
 e over time.
SUMMARY:Issues in Evaluation of Stream Learning Algorithms
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T171500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T165000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4916
DESCRIPTION:Tracking new topics\, ideas\, and "memes" across the Web has 
 been an issue of considerable interest. Recent work has developed method
 s for tracking topic shifts over long time scales\, as well as abrupt sp
 ikes in the appearance of particular named entities. However\, these app
 roaches are less well suited to the identification of content that sprea
 ds widely and then fades over time scales on the order of days --- the t
 ime scale at which we perceive news and events.\n\nWe develop a framewor
 k for tracking short\, distinctive phrases that travel relatively intact
  through on-line text\; developing scalable algorithms for clustering te
 xtual variants of such phrases\, we identify a broad class of memes that
  exhibit wide spread and rich variation on a daily basis. As our princip
 al domain of study\, we show how such a meme-tracking approach can provi
 de a coherent representation of the news cycle --- the daily rhythms in 
 the news media that have long been the subject of qualitative interpreta
 tion but have never been captured accurately enough to permit actual qua
 ntitative analysis. We tracked 1.6 million mainstream media sites and bl
 ogs over a period of three months with the total of 90 million articles 
 and we find a set of novel and persistent temporal patterns in the news 
 cycle. In particular\, we observe a typical lag of 2.5 hours between the
  peaks of attention to a phrase in the news media and in blogs respectiv
 ely\, with divergent behavior around the overall peak and a ``heartbeat'
 '-like pattern in the handoff between news and blogs. We also develop an
 d analyze a mathematical model for the kinds of temporal variation that 
 the system exhibits.
SUMMARY:Meme-tracking and the dynamics of the news cycle
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T142500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4917
DESCRIPTION:Overall performance of the data mining process depends not ju
 st on the value of the induced knowledge but also on various costs of th
 e process itself such as the cost of acquiring and pre-processing traini
 ng examples\, the CPU cost of model induction\, and the cost of committe
 d errors. Recently\, several progressive sampling strategies for maximiz
 ing the overall data mining utility have been proposed. All these strate
 gies are based on repeated acquisitions of additional training examples 
 until a utility decrease is observed. In this paper\, we present an alte
 rnative\, projective sampling strategy\, which fits functions to a parti
 al learning curve and a partial run-time curve obtained from a small sub
 set of potentially available data and then uses these projected function
 s to analytically estimate the optimal training set size. The proposed a
 pproach is evaluated on a variety of benchmark datasets using the RapidM
 iner environment for machine learning and data mining processes. The res
 ults show that the learning and run-time curves projected from only seve
 ral data points can lead to a cheaper data mining process than the commo
 n progressive sampling methods.
SUMMARY:Improving Data Mining Utility with Projective Sampling
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T121500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T120000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4918
DESCRIPTION:Spectral clustering refers to a flexible class of clustering 
 procedures that can produce high-quality clusterings on small data sets 
 but which has limited applicability to large-scale problems due to its c
 omputational complexity of O(n3) in general\, with n the number of data 
 points. We extend the range of spectral clustering by developing a gener
 al framework for fast approximate spectral clustering in which a distort
 ion-minimizing local transformation is first applied to the data. This f
 ramework is based on a theoretical analysis that provides a statistical 
 characterization of the effect of local distortion on the mis-clustering
  rate. We develop two concrete instances of our general framework\, one 
 based on local k-means clustering (KASP) and one based on random project
 ion trees (RASP).  Extensive experiments show that these algorithms can 
 achieve significant speedups with little degradation in clustering accur
 acy. Specifically\, our algorithms outperform k-means by a large margin 
 in terms of accuracy\, and run several times faster than approximate spe
 ctral clustering based on the Nystrom method\, with comparable accuracy 
 and significantly smaller memory footprint. Remarkably\, our algorithms 
 make it possible for a single machine to spectral cluster data sets with
  a million observations within several minutes.
SUMMARY:Fast Approximate Spectral Clustering
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T171500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T165000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4919
DESCRIPTION:Best Student Paper Award Winner\n\nSharing healthcare data ha
 s become a vital requirement in healthcare system management\; however\,
  inappropriate sharing and usage of healthcare data could threaten patie
 nts' privacy. In this paper\, we study the privacy concerns of the blood
  transfusion information-sharing system between the Hong Kong Red Cross 
 Blood Transfusion Service (BTS) and public hospitals\, and identify the 
 major challenges that make traditional data anonymization methods not ap
 plicable. Furthermore\, we propose a new privacy model called LKC-privac
 y\, together with an anonymization algorithm\, to meet the privacy and i
 nformation requirements in this BTS case. Experiments on the real-life d
 ata demonstrate that our anonymization algorithm can effectively retain 
 the essential information in anonymous data for data analysis and is sca
 lable for anonymizing large datasets.
SUMMARY:Anonymizing Healthcare Data: A Case Study on the Blood Transfusio
 n Service
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T170500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T165000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4920
DESCRIPTION:The problem of ethnicity identification from names has a vari
 ety of important applications\, including biomedical research\, demograp
 hic studies\, and marketing.  Here we report on the development of an et
 hnicity classifier where all training data is extracted from public\, no
 n-confidential (and hence somewhat unreliable) sources.  Our classifier 
 uses hidden Markov models (HMMs) and decision trees to classify names in
 to 13 cultural/ethnic groups with individual group accuracy comparable a
 ccuracy to earlier binary (e.g.\, Spanish/non-Spanish) classifiers.  We 
 have applied this classifier to over 20 million names from a large-scale
  news corpus\, identifying interesting temporal and spatial trends on th
 e representation of particular cultural/ethnic groups.
SUMMARY:Name-Ethnicity Classification from Open Sources
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T124500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T123000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4921
DESCRIPTION:In realistic settings the prevalence of a class may change af
 ter a classifier is induced and this will degrade the performance of the
  classifier. Further complicating this scenario is the fact that labeled
  data is often scarce and expensive. In this paper we address the proble
 m where the class distribution changes and only unlabeled examples are a
 vailable from the new distribution. We design and evaluate a number of m
 ethods for coping with this problem and compare the performance of these
  methods. Our quantification-based methods estimate the class distributi
 on of the unlabeled data from the changed distribution and adjust the or
 iginal classifier accordingly\, while our semi-supervised methods build 
 a new classifier using the examples from the new (unlabeled) distributio
 n which are supplemented with predicted class values. We also introduce 
 a hybrid method that utilizes both quantification and semi-supervised le
 arning. All methods are evaluated using accuracy and F-measure on a set 
 of benchmark data sets. Our results demonstrate that our methods yield s
 ubstantial improvements in accuracy and F-measure.
SUMMARY:Quantification and Semi-supervised Classification Methods for Han
 dling Changes in Class Distribution
LOCATION:La Seine C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T144000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T142500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4922
DESCRIPTION:Influence maximization is the problem of finding a small subs
 et of nodes (seed nodes) in a social network that could maximize the spr
 ead of influence.  In this paper\, we study the efficient influence maxi
 mization from two complementary directions.  One is to improve the origi
 nal greedy algorithm of [KKT03] and its improvement [LKGFVG07] to furthe
 r reduce its running time\, and the second is to propose new degree disc
 ount heuristics that improves influence spread.  We evaluate our algorit
 hms by experiments on two large academic collaboration graphs obtained f
 rom the online archival database arXiv.org.  Our experimental results sh
 ow that (a) our improved greedy algorithm achieves better running time c
 omparing with the improvement of [LKGFVG07] with matching influence spre
 ad\, (b) our degree discount heuristics achieve much better influence sp
 read than classic degree and centrality-based heuristics\, and when tune
 d for a specific influence cascade model\, it achieves almost matching i
 nfluence thread with the greedy algorithm\, and more importantly (c) the
  degree discount heuristics run only in milliseconds while even the impr
 oved greedy algorithms run in hours in our experiment graphs with a few 
 tens of thousands of nodes.\n\nBased on our results\, we believe that fi
 ne-tuned heuristics may provide truly scalable solutions to the influenc
 e maximization problem with satisfying influence spread and blazingly fa
 st running time. Therefore\, contrary to what implied by the conclusion 
 of [KKT03] that traditional heuristics are outperformed by the greedy ap
 proximation algorithm\, our results shed new lights on the research of h
 euristic algorithms.
SUMMARY:Efficient Influence Maximization in Social Networks
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T120500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T115000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4923
DESCRIPTION:The pervasiveness of mobile devices and location based servic
 es is leading to an increasing volume of mobility data. This side effect
  provides the opportunity for innovative methods that analyze the behavi
 ors of movements.\n\nIn this paper we propose WhereNext\, which is a met
 hod aimed at predicting with a certain level of accuracy the next locati
 on of a moving object. The prediction uses previously extracted movement
  patterns named Trajectory Patterns\, which are a concise representation
  of behaviors of moving objects as sequences of regions frequently visit
 ed with a typical travel time.\n\nA decision tree\, named T-pattern Tree
 \, is built and evaluated with a formal training and test process. The t
 ree is learned from the Trajectory Patterns that hold a certain area and
  it may be used as a predictor of the next location of a new trajectory 
 finding the best matching path in the tree. Three different best matchin
 g methods to classify a new moving object are proposed and their impact 
 on the quality of prediction is studied extensively.\n\nUsing Trajectory
  Patterns as predictive rules has the following implications: (I) the le
 arning depends on the movement of all available objects in a certain are
 a instead of on the individual history of an object\; (II) the predictio
 n tree intrinsically contains the spatio-temporal properties that have e
 merged from the data and this allows us to define matching methods that 
 striclty depend on the properties of such movements.\n\nIn addition\, we
  propose a set of other measures\, that evaluate a priori the predictive
  power of a set of Trajectory Patterns. This measures were tuned on a re
 al life case study. Finally\, an exhaustive set of experiments and resul
 ts on the real dataset are presented.\n
SUMMARY:WhereNext: a Location Predictor on Trajectory Pattern Mining
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T115000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T113500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4924
DESCRIPTION:Given a real\, and weighted person-to-person network which ch
 anges over time\, what can we say about the cliques that it contains? Do
  the incidents of communication\, or weights on the edges of a clique fo
 llow any pattern? Real\, and in-person social networks have many more tr
 iangles than chance would dictate. As it turns out\, there are many more
  cliques than one would expect\, in surprising patterns.\n\nIn this pape
 r\, we study massive real-world social networks formed by direct contact
 s among people through various personal communication services\, such as
  Phone-Call\, SMS\, IM etc.  The contributions are the following: (a) we
  discover surprising patterns with the cliques\, (b) we report power-law
 s of the weights on the edges of cliques\, (c) our real networks follow 
 these patterns such that we can trust them to spot outliers and finally\
 , (d) we propose the first utility-driven graph generator for weighted t
 ime-evolving networks\, which match the observed patterns. Our study foc
 used on three large datasets\, each of which is a different type of comm
 unication service\, with over one million records\, and spans several mo
 nths of activity.
SUMMARY:Large Human Communication Networks: Patterns and a Utility-Driven
  Generator
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T120500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T115000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4925
DESCRIPTION:We present novel semi-supervised boosting algorithms that inc
 rementally build linear combinations of weak classifiers through generic
  functional gradient descent using both labeled and unlabeled training d
 ata. Our approach is based on extending information regularization frame
 work to boosting\, bearing loss functions that combine log loss on label
 ed data with the information-theoretic measures to encode unlabeled data
 . Even though the information-theoretic regularization terms make the op
 timization non-convex\, we propose simple sequential gradient descent op
 timization algorithms\, and obtain impressively improved results on synt
 hetic\, benchmark and real world tasks over supervised boosting algorith
 ms which use the labeled data alone and a state-of-the-art semi-supervis
 ed boosting algorithm.
SUMMARY:Information Theoretic Regularization for Semi-Supervised Boosting
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T153500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T152000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4926
DESCRIPTION:Given multiple time sequences with missing values\, we propos
 e DynaMMo which summarizes\, compresses\, and finds latent variables. Th
 e idea is to discover hidden variables and learn their dynamics\, making
  our algorithm able to function even when there are missing values. We p
 erformed  experiments  on  both  real  and  synthetic datasets spanning 
 several megabytes\, including motion capture sequences and chlorine leve
 ls in drinking water.   \n\nWe show that our proposed DynaMMo method \n(
 a) can successFully learn the latent variables and their evolution\; \n(
 b) can provide high compression for little loss of reconstruction accura
 cy\; \n(c) can extract compact but powerful features for segmentation\, 
 interpretation\, and forecasting\; \n(d) has complexity linear on the du
 ration of sequences.
SUMMARY:DynaMMo: Mining and Summarization of Coevolving Sequences with Mi
 ssing Values
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T120500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T115000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4927
DESCRIPTION:Recent innovations have resulted in a plethora of social appl
 ications on the Web\, such as blogs\, social networks\, and community ph
 oto and video sharing applications. Such applications can typically be r
 epresented as evolving interaction graphs with nodes denoting entities a
 nd edges representing their interactions. The study of entities and comm
 unities and how they evolve in such large dynamic graphs is both importa
 nt and challenging.\n\nWhile much of the past work in this area has focu
 sed on static analysis\, more recently researchers have investigated dyn
 amic analysis. In this paper\, in a departure from recent efforts\, we c
 onsider the problem of analyzing patterns and critical events that affec
 t the dynamic graph from the viewpoint of a single node\, or a selected 
 subset of nodes. Defining and extracting a relevant viewpoint neighborho
 od efficiently\, while also quantifying the key relationships among node
 s involved are the key challenges we address. We also examine the evolut
 ion of viewpoint neighborhoods for different entities over time to ident
 ify key structural and behavioral transformations that occur.
SUMMARY:A Viewpoint-based Approach for Interaction Graph Analysis
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T142500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4928
DESCRIPTION:Corruption of data by class-label noise is an important pract
 ical concern impacting many classification problems. Studies of data cle
 aning techniques often assume a uniform label noise model\, however\, wh
 ich is seldom realized in practice. Relatively little is understood\, as
  to how the natural label noise distribution can be measured or simulate
 d. Using email spam-filtering data\, we demonstrate that class noise can
  have substantial content specific bias. We also demonstrate that noise 
 detection techniques based on classifier confidence tend to identify ins
 tances that human assessors are likely to label in error. We show that g
 enre modeling can be very informative in identifying potential areas of 
 mislabeling. Moreover\, we are able to show that genre decomposition can
  also be used to substantially improve spam filtering accuracy\, with ou
 r results outperforming the best published figures for the trec05-p1 and
  ceas-2008 benchmark collections.
SUMMARY:Genre-based Decomposition of Email Class Noise
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T120500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T115000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4929
DESCRIPTION:We describe a recommender system in the domain of grocery sho
 pping. While recommender systems have been widely studied\, this is most
 ly in relation to leisure products (e.g. movies\, books and music) with 
 non-repeated purchases. In grocery shopping\, however\, consumers will m
 ake multiple purchases of the same or very similar products more frequen
 tly than buying entirely new items. The proposed recommendation scheme o
 ffers several advantages in addressing the grocery shopping problem\, na
 mely: 1) a product similarity measure that suits a domain where no ratin
 g information is available\; 2) a basket sensitive random walk model to 
 approximate product similarities by exploiting incomplete neighborhood i
 nformation\; 3) online adaptation of the recommendation based on the cur
 rent basket and 4) a new performance measure focusing on products that c
 ustomers have not purchased before or purchase infrequently. Empirical r
 esults benchmarking on three real-world data sets demonstrate a performa
 nce improvement of the proposed method over other existing collaborative
  filtering models.
SUMMARY:Grocery Shopping Recommendations Based on Basket-Sensitive Random
  Walk
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T120000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T114500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4930
DESCRIPTION:In recent years\, the number of patents filed by the business
  enterprises in the technology industry are growing rapidly\, thus provi
 ding unprecedented opportunities for knowledge discovery in patent data.
  One important task in this regard is to employ data mining techniques t
 o rank patents in terms of their potential to earn money through licensi
 ng.  Availability of such ranking can substantially reduce enterprise IP
  (Intellectual Property) management costs. Unfortunately\, the existing 
  software systems in the IP domain do not address this task directly. Th
 rough our research\, we build a patent ranking software\, named COA (Cla
 im Originality Analysis) that rates a patent based on its value by measu
 ring the recency and the impact of the important phrases that appear in 
 the "claims" section of a patent. Experiments show that COA produces mea
 ningful ranking when comparing it with other indirect patent evaluation 
 metrics--- citation count\, patent status\, and attorney's rating. In re
 al-life settings\, this tool was used by beta-testers in the IBM IP depa
 rtment. Lawyers found it very useful in patent rating\, specifically\, i
 n highlighting potentially valuable patents in a patent cluster.  In thi
 s article\, we describe the ranking techniques and system architecture o
 f COA. We also present the results that validate its effectiveness.
SUMMARY:COA: Finding Novel Patents through Text Analysis
LOCATION:La Seine C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T145000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T142500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4931
DESCRIPTION:In traditional text clustering methods\, documents are repres
 ented as bags of words without considering the semantic information of
  each document. For instance\, if two documents use different collection
 s of core words to represent the same topic\,   they may be falsely assi
 gned to different clusters due to the lack of shared core words\, althou
 gh the core words they use are probably synonyms or semantically associa
 ted in other forms. The most common way to solve this problem is to enri
 ch document representation with the background knowledge in an ontology.
  There are two major issues for this approach: (1) the coverage of the o
 ntology is limited\, even for WordNet or Mesh\, (2) using ontology terms
  as replacement or additional features may cause information loss\, or i
 ntroduce noise. In this paper\, we present a novel text clustering metho
 d to address these two issues by enriching document representation with 
 Wikipedia concept and category information. We develop two approaches\, 
 exact match and relatedness-match\, to map text documents to Wikipedia c
 oncepts\, and further to Wikipedia categories. Then the text documents a
 re clustered based on a similarity metric which combines document conten
 t information\, concept information as well as category information. The
  experimental results using the proposed clustering framework on three d
 atasets (20-newsgroup\, TDT2\, and LA Times) show that clustering perfor
 mance improves significantly by enriching document representation with W
 ikipedia concepts and categories.
SUMMARY:Exploiting Wikipedia as External Knowledge for Document Clusterin
 g
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T113500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T112000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4932
DESCRIPTION:Algorithms based on simulating stochastic flows are a simple 
 and natural  solution for the problem of clustering graphs\, but their w
 idespread use has been hampered by their lack of scalability and fragmen
 tation of output. In this article we present a multi-level algorithm for
  graph clustering using flows that delivers significant improvements in 
 both quality and speed. The graph is first successively coarsened to a m
 anageable size\, and a small number of iterations of flow simulation is 
 performed on the coarse graph. The graph is then successively refined\, 
 with flows from the previous graph used as initializations for brief flo
 w simulations on each of the intermediate graphs. When we reach the fina
 l refined graph\, the algorithm is run to convergence and the high-flow 
 regions are clustered together\, with regions  without any flow forming 
 the natural boundaries of the clusters. Extensive experimental results o
 n several real and synthetic datasets demonstrate the effectiveness of o
 ur approach when compared to state-of-the-art algorithms.
SUMMARY:Scalable Graph Clustering Using Stochastic Flows: Applications to
  Community Discovery
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T105500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4933
DESCRIPTION:Software is a ubiquitous component of our daily life. We ofte
 n depend on the correct working of software systems.  Due to the difficu
 lty and complexity of software systems\, bugs and anomalies are prevalen
 t. Bugs have caused billions of dollars loss\, in addition to privacy an
 d security threats. In this work\, we address software reliability issue
 s by proposing a novel method to classify software behaviors based on pa
 st history or runs. With the technique\, it is possible to generalize pa
 st known errors and mistakes to capture failures and anomalies. Our tech
 nique first mines a set of discriminative features capturing repetitive 
 series of events from program execution traces. It then performs feature
  selection to select the best features for classification. These feature
 s are then used to train a classifier to detect failures. Experiments an
 d case studies on traces of several benchmark software systems and a rea
 l-life concurrency bug from MySQL server show the utility of the techniq
 ue in capturing failures and anomalies. On average\, our pattern-based c
 lassification technique outperforms the baseline approach by 24.68% in a
 ccuracy.
SUMMARY:Classification of Software Behaviors for Failure Detection: A Dis
 criminative Pattern Mining Approach
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T113500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T111000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4934
DESCRIPTION:Address standardization is a very challenging task in data cl
 eansing. To provide better  customer relationship management and busines
 s intelligence for customer-oriented cooperates\,  millions of free-text
  addresses need to be  converted to a standard format for data integrati
 on\, de-duplication and householding. Existing commercial  tools usually
  employ lots of  hand-craft\, domain-specific rules and  reference data 
 dictionary of cities\, states etc.  These rules work better for the regi
 on they are designed. However\, rule-based methods usually require more 
 human efforts to rewrite these rules for each new domain since address d
 ata   are very irregular and  varied with countries and regions. Supervi
 sed learning methods usually are more adaptable than rule-based approach
 es.  However\, supervised methods  need large-scale  labeled  training d
 ata.  It is a labor-intensive and time-consuming task to build a large-s
 cale annotated corpus for each target domain. For minimizing  human effo
 rts and the size of labeled training data set\, we present a free-text a
 ddress standardization method with  latent semantic association (LaSA). 
   LaSA model is constructed to capture latent semantic association among
  words from  the unlabeled corpus.   The original term space of the targ
 et domain is projected to a concept space using LaSA model at first\, th
 en the address standardization model is  active learned from LaSA featur
 es and informative samples.  The proposed method effectively captures th
 e data distribution of the domain.  Experimental results on large-scale 
 English and Chinese  corpus show that the proposed method significantly 
 enhances the performance of standardization  with less efforts and train
 ing data.
SUMMARY:Address Standardization with Latent Semantic Association
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T172000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T170500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4935
DESCRIPTION:This paper addresses the issue of unsupervised network anomal
 y detection. In recent years\, networks have played more and more critic
 al roles. Since their outages cause serious economic losses\, it is quit
 e significant to monitor their changes over time and to detect anomalies
  as early as possible. In this paper\, we specifically focus on the mana
 gement of the whole network. In it\, it is important to detect anomalies
  which make great impact on the whole network\, and the other local anom
 alies should be ignored. Further\, when we detect the former anomalies\,
  it is required to localize nodes responsible for them. It is challengin
 g to simultaneously perform the above two tasks \ntaking into account th
 e nonstationarity and strong correlations between nodes.\n\nWe propose a
  network anomaly detection method which resolves the above two tasks in 
 a unified way. The key ideas of the method are: (1)construction of quant
 ities representing feature of a whole network and each node from the sam
 e input based on eigen equation compression\, and (2)incremental anomalo
 usness scoring based on learning the probability distribution of the qua
 ntities.\n\nWe demonstrate through the experimental results using two be
 nchmark data sets and a simulation data set that anomalies of a whole ne
 twork and nodes responsible for them can be detected by the proposed met
 hod.
SUMMARY:Network Anomaly Detection based on Eigen Equation Compression
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T112000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T105500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4936
DESCRIPTION:Conditional random fields(CRFs) are a class of undirected gra
 phical models which have been widely used for classifying and labeling s
 equence data. The training of CRFs is typically formulated as an unconst
 rained optimization problem that maximizes the conditional likelihood. H
 owever\, maximum likelihood training is prone to overfitting. To address
  this issue\, we propose a novel constrained nonlinear optimization form
 ulation in which the prediction accuracy of cross-validation sets are in
 cluded as constraints. Instead of requiring multiple passes of training\
 , the constrained formulation\nallows the cross-validation be handled in
  one pass of constrained optimization.\n\nThe new formulation is discont
 inuous\, and classical Lagrangian based constraint handling methods are 
 not applicable. A new constrained optimization algorithm based on the re
 cently proposed extended saddle point theory is developed to learn the c
 onstrained CRF model. Experimental results on gene and stock-price predi
 ction tasks show\nthat the constrained formulation is able to significan
 tly improve the generalization ability of CRF training.
SUMMARY:Constrained Optimization for Validation-Guided Conditional Random
  Field Learning
LOCATION:Foyer Rives de Seine - Pont des Arts A & B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T151500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T145000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4937
DESCRIPTION:Search logs\, which contain rich and up-to-date information a
 bout users' needs and preferences\, have become a critical data source f
 or search engines. Recently\, more and more data-driven applications are
  being developed in search engines based on search logs\, such as query 
 suggestion\, keyword bidding\, and dissatisfactory query analysis. In th
 is paper\, by observing that many data-driven applications in search eng
 ines highly rely on online mining of search logs\, we develop an OLAP sy
 stem on search logs which serves as an infrastructure supporting various
  data-driven applications. An empirical study using real data of over tw
 o billion query sessions demonstrates the usefulness and feasibility of 
 our design.
SUMMARY:OLAP on Search Logs: An Infrastructure Supporting Data-Driven App
 lications in Search Engines
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T144000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T142500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4938
DESCRIPTION:Lately there exist increasing demands for online abnormality 
 monitoring over trajectory streams\, which are obtained from moving obje
 ct tracking devices. This problem is challenging due to the requirement 
 of high speed data processing within limited space cost. In this paper\,
  we present a novel framework for monitoring anomalies over continuous t
 rajectory streams. First\, we illustrate the importance of distance-base
 d anomaly monitoring over moving object trajectories. Then\, we utilize 
 the local continuity characteristics of trajectories to build local clus
 ters upon trajectory streams and monitor anomalies via efficient pruning
  strategies. Finally\, we propose a piecewise metric index structure to 
 reschedule the joining order of local clusters to further reduce the tim
 e cost. Our extensive experiments demonstrate the effectiveness and effi
 ciency of our methods.
SUMMARY:Efficient Anomaly Monitoring Over Moving Object Trajectory Stream
 s
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T142500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4939
DESCRIPTION:Linear classifiers have been shown to be effective for many d
 iscrimination tasks.  Irrespective of the learning algorithm itself\, th
 e final classifier has a weight to multiply by each feature. This sugges
 ts that ideally each input feature should be linearly correlated with th
 e target variable (or anti-correlated)\, whereas raw features may be hig
 hly non-linear.  In this paper\, we attempt to re-shape each input featu
 re so that it is appropriate to use with a linear weight and to scale th
 e different features in proportion to their predictive value.  We demons
 trate that this pre-processing is beneficial for linear SVM classifiers 
 on a large benchmark of text classification tasks as well as UCI dataset
 s.
SUMMARY:Feature Shaping for Linear SVM Classifiers
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T170500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T165000
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4940
DESCRIPTION:As the number and size of large timestamped collections (e.g.
  sequences of digitized newspapers\, periodicals\, blogs) increase\, the
  problem of efficiently indexing and searching such data becomes more im
 portant. Term burstiness has been extensively researched as a mechanism 
 to address event detection in the context of such collections. In this p
 aper\, we explore how burstiness information can be further utilized to 
 enhance the search process. We present a novel approach to model the bur
 stiness of a term\, using discrepancy theory concepts. This allows us to
  build a parameter-free\, linear-time approach to identify the time inte
 rvals of maximum burstiness for a given term.\nFinally\, we describe the
  first burstiness-driven search framework and thoroughly evaluate our ap
 proach in the context of different scenarios.
SUMMARY:On Burstiness-aware Search for Document Sequences
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T122000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T120500
DTSTAMP;VALUE=DATE-TIME:20120516T153738Z
UID:4941
DESCRIPTION:Recently many data types arising from data mining and Web sea
 rch applications can be modeled as bipartite graphs. Examples include qu
 eries and URLs in query logs\, and authors and papers in scientific lite
 rature. However\, one of the issues is that previous algorithms only con
 sider the content and link information from one side of the bipartite gr
 aph. There is a lack of constraints to make sure the final relevance of 
 the score propagation on the graph\, as there are many noisy edges withi
 n the bipartite graph. In this paper\, we propose a novel and general Co
 -HITS algorithm to incorporate the bipartite graph with the content info
 rmation from both sides as well as the constraints of relevance. Moreove
 r\, we investigate the algorithm based on two frameworks\, including the
  iterative and the regularization frameworks\, and illustrate the genera
 lized Co-HITS algorithm from different views. For the iterative framewor
 k\, it contains HITS and personalized PageRank as special cases. In the 
 regularization framework\, we successfully build a connection with HITS\
 , and develop a new cost function to consider the direct relationship be
 tween two entity sets\, which leads to a significant improvement over th
 e baseline method. To illustrate our methodology\, we apply the Co-HITS 
 algorithm\, with many different settings\, to the application of query s
 uggestion by mining the AOL query log data. Experimental results demonst
 rate that CoRegu-0.5 (i.e.\, a model of the regularization framework) ac
 hieves the best performance with consistent and promising improvements.
SUMMARY:A Generalized Co-HITS Algorithm and Its Application to Bipartite 
 Graphs
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T121500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T115000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4942
DESCRIPTION:Advanced analysis of data streams is quickly becoming a key a
 rea of data mining research as the number of applications demanding such
  processing increases. Online mining when such data streams evolve over 
 time\, that is when concepts drift or change completely\, is becoming on
 e of the core issues. When tackling non-stationary concepts\, ensembles 
 of classifiers have several advantages over single classifier methods: t
 hey are easy to scale and parallelize\, they can adapt to change quickly
  by pruning under-performing parts of the ensemble\, and they therefore 
 usually also generate more accurate concept descriptions.\n\nThis paper 
 proposes a new experimental data stream framework for studying concept d
 rift\, and two new  variants of Bagging: ADWIN Bagging and Adaptive-Size
  Hoeffding Tree (ASHT) Bagging. Using the new experimental framework\, a
 n evaluation study on synthetic and real-world datasets comprising up to
  ten million examples shows that the new ensemble methods perform very w
 ell compared to several known methods.
SUMMARY:New ensemble methods for evolving data streams
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T145500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T144000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4943
DESCRIPTION:How can we automatically spot all outstanding observations in
  a data set? This question arises in a large variety of applications\, e
 .g. in economy\, biology and medicine. Existing approaches to outlier de
 tection suffer from one or more of the following drawbacks:  The results
  of many methods strongly depend on suitable parameter settings being ve
 ry difficult to estimate without background\nknowledge on the data\, e.g
 . the minimum cluster size or the number of desired outliers. Many metho
 ds implicitly assume Gaussian or uniformly distributed data\, and/or the
 ir result is difficult to interpret. To cope with these problems\, we pr
 opose CoCo\, a technique for parameter-free outlier detection. The basic
  idea of our technique relates outlier detection to data compression: Ou
 tliers are objects which can not be effectively compressed given the dat
 a set. To avoid the assumption of a certain data distribution\, CoCo rel
 ies on a very general data model combining\nthe Exponential Power Distri
 bution with Independent Components. We define an intuitive outlier facto
 r based on the principle of the Minimum Description Length together with
  an novel algorithm for outlier detection. An extensive experimental eva
 luation on synthetic and real world data demonstrates the benefits of ou
 r technique. Availability: The source code of CoCo and the data sets use
 d in the experiments are available at:\nhttp://www.dbs.ifi.lmu.de/Forsch
 ung/KDD/Boehm/CoCo.
SUMMARY:CoCo: Coding Cost for Parameter-free Outlier Detection
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T112500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T111000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4944
DESCRIPTION:Time series prediction is an important issue in a wide range 
 of areas. There are various real world processes whose states vary conti
 nuously\, and those processes may have influences on each other. If the 
 past information of one process X improves the predictability of another
  process Y\, X is said to have a causal influence on Y. In order to make
  good predictions\, it is necessary to identify the appropriate causal r
 elationships. In addition\, the processes to be modeled may include symb
 olic data as well as numerical data. Therefore\, it is important to deal
  with symbolic and numerical time series seamlessly when attempting to d
 etect causality.\n\nIn this paper\, we propose a new method for quantify
 ing the strength of the causal influence from one time series to another
 . The proposed method can represent the strength of causality as the num
 ber of bits\, whether each of two time series is symbolic or numerical. 
 The proposed method can quantify causality even from a small number of s
 amples. In addition\, we propose structuring and modeling methods for mu
 ltivariate time series using causal relationships of two time series. Ou
 r structuring and modeling methods can also deal with data sets which in
 clude both types of time series. Experimental results demonstrate that o
 ur methods can perform well even if the number of samples is small.
SUMMARY:Causality Quantification and Its Applications: Structuring and Mo
 deling of Multivariate Time Series
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T173500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T172000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4945
DESCRIPTION:One common predictive modeling challenge occurs in text minin
 g problems is that the training data and the operational (testing) data 
 are drawn from different underlying distributions. This poses a great di
 fficulty for many statistical learning methods. However\, when the distr
 ibution in the source domain and the target domain are not identical but
  related\, there may exist a shared concept space to preserve the relati
 on. Consequently a good feature representation can encode this concept s
 pace and minimize the distribution gap. To formalize this intuition\, we
  propose a domain adaptation method that parameterizes this concept spac
 e by linear transformation under which we explicitly minimize the distri
 bution difference between the source domain with sufficient labeled data
  and target domains with only unlabeled data\, while at the same time mi
 nimizing the empirical loss on the labeled data in the source domain. An
 other characteristic of our method is its capability for considering mul
 tiple classes and their interactions simultaneously. We have conducted e
 xtensive experiments on two common text mining problems\, namely\, infor
 mation extraction and document classification to demonstrate the effecti
 veness of our proposed method.
SUMMARY:Extracting Discriminative Concepts for Domain Adaptation in Text 
 Mining
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T165500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T164000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4946
DESCRIPTION:Research in relational data mining has two major directions: 
 finding global models of a relational database and the discovery of loca
 l relational patterns within a database. While relational patterns show 
 how attribute values co-occur in detail\, their huge numbers hamper thei
 r usage in data analysis. Global models\, on the other hand\, only provi
 de a summary of how different tables and their attributes relate to each
  other\, lacking detail of what is going on at the local level.\n\nIn th
 is paper we introduce a new approach that combines the positive properti
 es of both directions: it provides a detailed description of the complet
 e database using a small set of patterns. More in particular\, we utilis
 e a rich pattern language and show how a database can be encoded by such
  patterns. Then\, based on the MDLprinciple\, the novel RDB-KRIMP algori
 thm selects the set of patterns that allows for the most succinct encodi
 ng of the database. This set\, the code table\, is a compact description
  of the database in terms of local relational patterns. We show that thi
 s resulting set is very small\, both in terms of database size and in nu
 mber of its local relational patterns: a reduction of up to 4 orders of 
 magnitude is attained.
SUMMARY:Characteristic Relational Patterns
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T105500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4947
DESCRIPTION:Website traffic varies through time in consistent and predict
 able ways\, with highest traffic in the middle of the day.  When providi
 ng media content to visitors\, it is important to present repeat visitor
 s with new content so that they keep coming back.  In this paper we pres
 ent an algorithm to balance the need to keep a website fresh with new co
 ntent with the desire to present the best content to the most visitors a
 t times of peak traffic.  We formulate this as the media scheduling prob
 lem\, where we attempt to maximize total clicks\, given the overall traf
 fic pattern and the time varying clickthrough rates of available media c
 ontent.  We present an efficient algorithm to perform this scheduling un
 der certain conditions and apply this algorithm to real data obtained fr
 om server logs\, showing evidence of significant improvements in traffic
  from our algorithmic schedules.  Finally\, we analyze the click data\, 
 presenting models for why and how the clickthrough rate for new content 
 declines as it ages.
SUMMARY:Optimizing Web Traffic via the Media Scheduling Problem
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T112000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T105500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4948
DESCRIPTION:In this study\, we formalize a multi-focal learning problem\,
  where training data are partitioned into several different focal groups
  and the prediction model will be learned within each focal group. The m
 ulti-focal learning problem is motivated by numerous real-world learning
  applications. For instance\, for the same type of problems encountered 
 in a customer service center\, the problem descriptions from different c
 ustomers can be quite different. The experienced customers usually give 
 more precise and focused descriptions about the problem. In contrast\, t
 he inexperienced customers usually provide more diverse descriptions. In
  this case\, the examples from the same class in the training data can b
 e naturally in different focal groups. As a result\, it is necessary to 
 identify those natural focal groups and exploit them for learning at dif
 ferent focuses. The key developmental challenge is how to identify those
  focal groups in the training data. As a case study\, we exploit multi-f
 ocal learning for profiling problems in customer service centers. The re
 sults show that multifocal learning can significantly boost the learning
  accuracies of existing learning algorithms\, such as Support Vector Mac
 hines (SVMs)\, for classifying customer problems.
SUMMARY:Multi-focal Learning and Its Application to Customer Service Supp
 ort
LOCATION:La Seine C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T162500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T160000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4949
DESCRIPTION:Network data is ubiquitous\, encoding collections of relation
 ships   between entities such as people\, places\, genes\, or corporatio
 ns.   While many resources for networks of interesting entities are   em
 erging\, most of these can only annotate connections in a limited   fash
 ion.  Although relationships between entities are rich\, it is  impracti
 cal to manually devise complete characterizations of these   relationshi
 ps for every pair of entities on large\, real-world   corpora.\n\nIn thi
 s paper we present a novel probabilistic topic model to   analyze text c
 orpora and infer descriptions of its entities and of   relationships bet
 ween those entities.  We develop variational   methods for performing ap
 proximate inference on our model and demonstrate that our model can be p
 ractically deployed on large   corpora such as Wikipedia.  We show quali
 tatively and quantitatively   that our model can construct and annotate 
 graphs of relationships and make useful predictions.
SUMMARY:Connections between the Lines: Augmenting Social Networks with Te
 xt
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T105500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4950
DESCRIPTION:Probabilistic frequent itemset mining in uncertain transactio
 n databases semantically and computationally differs from traditional te
 chniques applied to standard certain transaction databases. The consid
 eration of existential uncertainty of item(sets)\, indicating the probab
 ility that an item(set) occurs in a transaction\, makes traditional tech
 niques inapplicable. In this paper\, we introduce new probabilistic form
 ulations of frequent itemsets based on possible world semantics. In this
  probabilistic context\, an itemset X is called frequent if the probabil
 ity that X occurs in at least minSup transactions is above a given thres
 hold ?. To the best of our knowledge\, this is the first approach addres
 sing this problem under possible worlds semantics. In consideration of t
 he probabilistic formulations\, we present a framework which is able to 
 solve the Probabilistic Frequent Itemset Mining (PFIM) problem efficient
 ly. An extensive experimental evaluation investigates the impact of our 
 proposed techniques and shows that our approach is orders of magnitude f
 aster than straight-forward approaches.
SUMMARY:Probabilistic Frequent Itemset Mining in Uncertain Databases
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T145000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T142500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4951
DESCRIPTION:Max-margin Markov networks (M3N) have shown great promise in 
 structured prediction and relational learning. Due to the KKT conditions
 \, the M3N enjoys dual sparsity. However\, the existing M3N formulation 
 does not enjoy primal sparsity\, which is a desirable property for selec
 ting significant features and reducing the risk of over-fitting. In this
  paper\, we present an l1-norm regularized max-margin Markov network (l1
 -M3N)\, which enjoys dual and primal sparsity simultaneously. To learn a
 n l1-M3N\, we present three methods including projected sub-gradient\, c
 utting-plane\, and a novel EM-style algorithm\, which is based on an equ
 ivalence between l1-M3N and an adaptive M3N. We perform extensive empiri
 cal studies on both synthetic and real data sets. Our experimental resul
 ts show that: (1) l1-M3N can effectively select significant features\; (
 2) l1-M3N can perform as well as the pseudo-primal sparse Laplace M3N in
  prediction accuracy\, while consistently outperforms other competing me
 thods that enjoy either primal or dual sparsity\; and (3) the EM-algorit
 hm is more robust than the other two in prediction accuracy and time eff
 iciency.
SUMMARY:Primal Sparse Max-Margin Markov Networks
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T120000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T114500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4952
DESCRIPTION:Tag recommendation is the task of predicting a personalized l
 ist of tags for a user given an item. This is important for many website
 s with tagging capabilities like last.fm or delicious. In this paper\, w
 e propose a method for tag recommendation based on tensor factorization 
 (TF). In contrast to other TF methods like higher order singular value d
 ecomposition (HOSVD)\, our method RTF (`ranking with tensor factorizatio
 n') directly optimizes the factorization model for the best personalized
  ranking. RTF handles missing values and learns from pairwise ranking co
 nstraints. Our optimization criterion for TF is motivated by a detailed 
 analysis of the problem and of interpretation schemes for the observed d
 ata in tagging systems. In all\, RTF directly optimizes for the actual p
 roblem using a correct interpretation of the data. We provide a gradient
  descent algorithm to solve our optimization problem. We also provide an
  improved learning and prediction method with runtime complexity analysi
 s for RTF. The prediction runtime of RTF is independent of the number of
  observations and only depends on the factorization dimensions. Besides 
 the theoretical analysis\, we empirically show that our method outperfor
 ms other state-of-the-art tag recommendation methods like FolkRank\, Pag
 eRank and HOSVD both in quality and prediction runtime.
SUMMARY:Learning Optimal Ranking with Tensor Factorization for Tag Recomm
 endation
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T145500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T144000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4953
DESCRIPTION:In large social networks\, nodes (users\, entities) are influ
 enced by others for various reasons. For example\, the colleagues have s
 trong influence on one's work\, while the friends have strong influence 
 on one's daily life. How to differentiate the social influences from dif
 ferent angles(topics)? How to quantify the strength of those social infl
 uences? How to estimate the model on real large networks?\n\nTo address 
 these fundamental questions\, we propose Topical Affinity Propagation (T
 AP) to model the topic-level social influence on large networks. In part
 icular\, TAP can take results of any topic modeling and the existing net
 work structure to perform topic-level influence propagation. With the he
 lp of the influence analysis\, we present several important applications
  on real data sets such as 1) what are the representative nodes on a giv
 en topic? 2) how to identify the social influences of neighboring nodes 
 on a particular node? \n\nTo scale to real large networks\, TAP is desig
 ned with efficient distributed learning algorithms that is implemented a
 nd tested under the Map-Reduce framework. We further present the common 
 characteristics of distributed learning algorithms for Map-Reduce. Final
 ly\, we demonstrate the effectiveness and efficiency of TAP on real larg
 e data sets.
SUMMARY:Social Influence Analysis in Large-scale Networks
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T113500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T112000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4954
DESCRIPTION:Opinion mining became an important topic of study in recent y
 ears due to its wide range of applications. There are also many companie
 s offering opinion mining services. One problem that has not been studie
 d so far is the assignment of entities that have been talked about in ea
 ch sentence. Let us use forum discussions about products as an example t
 o make the problem concrete. In a typical discussion post\, the author m
 ay give opinions on multiple products and also compare them. The issue i
 s how to detect what products have been talked about in each sentence. I
 f the sentence contains the product names\, they need to be identified. 
 We call this problem entity discovery. If the product names are not expl
 icitly mentioned in the sentence but are implied due to the use of prono
 uns and language conventions\, we need to infer the products. We call th
 is problem entity assignment. These problems are important because witho
 ut knowing what products each sentence talks about the opinion mined fro
 m the sentence is of little use. In this paper\, we study these problems
  and propose two effective methods to solve the problems. Entity discove
 ry is based on pattern discovery and entity assignment is based on minin
 g of comparative sentences. Experimental results using a large number of
  forum posts demonstrate the effectiveness of the technique. Our system 
 has also been successfully tested in a commercial setting.
SUMMARY:Entity Discovery and Assignment for Opinion Mining Applications
LOCATION:Louis Armstrong A\,B\,C+D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T121500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4955
DESCRIPTION:Correlated or discriminative pattern mining is concerned with
  finding the highest scoring patterns w.r.t. a correlation measure (such
  as information gain). By reinterpreting correlation measures in ROC spa
 ce and formulating correlated itemset mining as a constraint programming
  problem\, we obtain new theoretical insights with practical benefits. M
 ore specifically\, we contribute 1) an improved bound for correlated ite
 mset miners\, 2) a novel iterative pruning algorithm to exploit the boun
 d\, and 3) an adaptation of this algorithm to mine all itemsets on the c
 onvex hull in ROC space. The algorithm does not depend on a minimal freq
 uency threshold and is shown to outperform several alternative approache
 s by orders of magnitude\, both in runtime and in memory requirements.
SUMMARY:Correlated Itemset Mining in ROC Space: A Constraint Programming 
 Approach
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T151000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T145500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4956
DESCRIPTION:Given a task T\, a pool of individuals X with different skill
 s\, and a social network G that captures the compatibility among these i
 ndividuals\, we study the problem of finding X\, a subset of X\, to perf
 orm the task. We call this the Team Formation problem. We require that m
 embers of X not only meet the skill requirements of the task\, but can a
 lso work effectively together as a team. We measure effectiveness using 
 the communication cost incurred by the subgraph in G that only involves 
 X. We study two variants of the problem for two different communication-
 cost functions\, and show that both variants are NP-hard. We explore the
 ir connections with existing combinatorial problems and give novel algor
 ithms for their solution. To the best of our knowledge\, this is the fir
 st work to consider the Team Formation problem in the presence of a soci
 al network of individuals. Experiments on the DBLP dataset show that our
  framework works well in practice and gives useful and intuitive results
 .
SUMMARY:Finding a Team of Experts in Social Networks
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T165000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T162500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4957
DESCRIPTION:Currently\, the most significant line of defense against malw
 are is anti-virus products which focus on authenticating valid software 
 from a white list\, blocking invalid software from a black list\, and ru
 nning any unknown software (i.e.\, the gray list) in a controlled manner
 . The gray list\, containing unknown software programs which could be ei
 ther normal or malicious\, is usually authenticated or rejected manually
  by virus analysts. Unfortunately\, along with the development of the ma
 lware writing techniques\, the number of file samples in the gray list t
 hat need to be analyzed by virus analysts on a daily basis is constantly
  increasing. In this paper\, we develop an intelligent file scoring syst
 em (IFSS for short) for malware detection from the gray list by an ensem
 ble of heterogeneous base-level classifiers derived by different learnin
 g methods\, using different feature representations on dynamic training 
 sets. To the best of our knowledge\, this is the first work of applying 
 such ensemble methods for malware detection. IFSS makes it practical for
 \nvirus analysts to identify malware samples from the huge gray list and
  improves the detection ability of anti-virus software. It has already b
 een incorporated into the scanning tool of Kingsoft's Anti-Virus softwar
 e. The case studies on large and real daily collection of the gray list 
 illustrate that the detection ability and efficiency of our IFSS system 
 outperforms other popular scanning tools such as NOD32 and Kaspersky.
SUMMARY:Intelligent File Scoring System for Malware Detection from the Gr
 ay List
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T105500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4958
DESCRIPTION:For implementing content management solutions and enabling ne
 w applications associated with data retention\, regulatory compliance\, 
 and litigation issues\, enterprises need to develop advanced analytics t
 o uncover relationships among the documents\, e.g.\, content similarity\
 , provenance\, and clustering.  In this paper\, we evaluate the performa
 nce of four syntactic similarity algorithms. Three algorithms are based 
 on Broder's ``shingling'' technique while the fourth algorithm employs a
  more recent approach\, ``content-based chunking''. For our experiments\
 , we use a specially designed corpus of documents that includes a set of
  ``similar'' documents with a controlled number of modifications. Our pe
 rformance study reveals that the similarity metric of all four algorithm
 s is  highly sensitive to settings of the algorithms' parameters: slidin
 g window size and fingerprint sampling frequency. We identify a useful r
 ange of these parameters for achieving good practical results\, and comp
 are the performance of the four algorithms in a controlled environment. 
  We validate our results by applying these algorithms to finding near-du
 plicates in two large collections of HP technical support documents.
SUMMARY:Applying Syntactic Similarity Algorithms for Enterprise Informati
 on Management
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T121500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T120000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4959
DESCRIPTION:Fault-tolerant frequent itemsets (FTFI) are variants of frequ
 ent itemsets for representing and discovering generalized knowledge. How
 ever\, despite growing interest in this field\, no previous approach min
 es proportional FTFIs with their exact support (FT-support).\n\nThis pro
 blem is difficult because of two concerns: (a) non anti-monotonic proper
 ty of FT-support when relaxation is proportional\, and (b) difficulty in
  computing FT-support. Previous efforts on this problem either simplify 
 the general problem by adding constraints\, or provide approximate solut
 ions without any error guarantees.\n\nIn this paper\, we address these c
 oncerns in the general FTFI mining problem. We limit the search space by
  providing provably correct anti monotone bounds for FT-support and deve
 lop practically efficient means of achieving them. Besides\, we also pro
 vide an efficient and exact FT-support counting procedure.\n\nExtensive 
 experiments using real datasets validate that our solution is reasonably
  efficient for completely mining FTFIs. Implementations for the algorith
 ms are available from http://www3.ntu.edu.sg/home/asvivek/pubs/ftfim09/.
SUMMARY:Towards Efficient Mining of Proportional Fault-Tolerant Frequent 
 Itemsets
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T112000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T105500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4960
DESCRIPTION:This paper explores an important and relatively unstudied qua
 lity measure of a sponsored search advertisement: bounce rate. The bounc
 e rate of an ad can be informally defined as the fraction of users who c
 lick on the ad but almost immediately move on to other tasks. A high bou
 nce rate can lead to poor advertiser return on investment\, and suggests
  search engine users may be having a poor experience follow ing the clic
 k. In this paper\, we first provide quantitative analysis showing that b
 ounce rate is an effective measure of user satisfaction. We then address
  the question\, can we predict bounce rate by analyzing the features of 
 the advertisement? An affirmative answer would allow advertisers and sea
 rch engines to predict the effectiveness and quality of advertisements b
 efore they are shown. We propose solutions to this problem involving lar
 ge-scale learning methods that leverage features drawn from ad creatives
  in addition to their keywords and landing pages.
SUMMARY:Predicting Bounce Rates in Sponsored Search Advertisements
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T111000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T105500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4961
DESCRIPTION:The importance of event logs\, as a source of information in 
 systems and network management cannot be overemphasized. With the ever i
 ncreasing size and complexity of today's event logs\, the task of analyz
 ing event logs has become cumbersome to carry out manually. For this rea
 son recent research has focused on the automatic analysis of these log f
 iles. In this paper we present IPLoM (Iterative Partitioning Log Mining)
 \, a novel algorithm for the mining of clusters from event logs. Through
  a 3-Step hierarchical partitioning process IPLoM partitions log data in
 to its respective clusters. In its 4th and final stage IPLoM produces cl
 uster descriptions or line formats for each of the clusters produced. Un
 like other similar algorithms IPLoM is not based on the Apriori algorith
 m and it is able to find clusters in data whether or not its instances a
 ppear frequently. Evaluations show that IPLoM outperforms the other algo
 rithms statistically significantly\, and it is also able to achieve an a
 verage F-Measure performance 78% when the closest other algorithm achiev
 es an F-Measure performance of 10%.
SUMMARY:Clustering Event Logs Using Iterative Partitioning
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T115000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T113500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4962
DESCRIPTION:Modern communication networks generate massive volume of oper
 ational event data\, e.g.\, alarm\, alert\, and metrics\, which can be u
 sed by a network management system (NMS) to diagnose potential faults. I
 n this work\, we introduce a new class of indexable  {\\it fault signatu
 res} that encode temporal  evolution of events generated by a network fa
 ult as well as topological relationships among the nodes where these eve
 nts occur. We present an efficient learning algorithm to extract such fa
 ult signatures from noisy historical event data\, and  with the help of 
 novel space-time indexing structures\, we show how to perform efficient\
 , online signature matching. We provide results from extensive experimen
 tal studies to explore the efficacy of our approach and point out potent
 ial applications of such signatures for many different types of networks
  including social and information networks.
SUMMARY:Learning\, Indexing\, and Diagnosing Network Faults
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T152000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T150500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4963
DESCRIPTION:All Netflix Prize algorithms proposed so far are prohibitivel
 y costly for large-scale production systems. In this paper\, we describe
  an efficient dataflow implementation of a collaborative filtering (CF) 
 solution to the Netflix Prize problem [1] based on weighted coclustering
  [5]. The dataflow library we use facilitates the development of sophist
 icated parallel programs designed to fully utilize commodity multicore h
 ardware\, while hiding traditional difficulties such as queuing\, thread
 ing\, memory management\, and deadlocks. The dataflow CF implementation 
 first compresses the large\, sparse training dataset into co-clusters. T
 hen it generates recommendations by combining the average ratings of the
  co-clusters with the biases of the users and movies. When configured to
  identify 20x20 co-clusters in the Netflix training dataset\, the implem
 entation predicted over 100 million ratings in 16.31 minutes and achieve
 d an RMSE of 0.88846 without any fine-tuning or domain knowledge. This i
 s an effective real-time prediction runtime of 9.7 ºs per rating which 
 is far superior to previously reported results. Moreover\, the implement
 ed co-clustering framework supports a wide variety of other large-scale 
 data mining applications and forms the basis for predictive modeling on 
 large\, dyadic datasets [4\, 7].
SUMMARY:Pervasive Parallelism in Data Mining: Dataflow solution to Co-clu
 stering Large and Sparse Netflix Data
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T173500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T172000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4964
DESCRIPTION:In this paper\, we consider the problem of combining link and
  content analysis for community detection from networked data\, such as 
 paper citation networks and Word Wide Web. Most existing approaches comb
 ine link and content information by a generative model that generates bo
 th links and contents via a shared set of community memberships. These g
 enerative models have some shortcomings in that they failed to consider 
 additional factors that could affect the\ncommunity memberships and isol
 ate the contents that are irrelevant to community memberships. To explic
 itly address these shortcomings\, we propose a discriminative model for 
 combining the link and content analysis for community detection. First\,
  we propose a conditional model for link analysis and in the model\, we 
 introduce hidden variables to explicitly model the popularity of nodes. 
 Second\, to alleviate the impact of irrelevant content attributes\, we d
 evelop a discriminative model for content analysis. These two models are
  unified seamlessly via the community memberships. We present efficient 
 algorithms to solve the related optimization problems based on bound opt
 imization and alternating projection. Extensive experiments with benchma
 rk data sets show that the proposed framework significantly outperforms 
 the state-of-the-art approaches for combining link and content analysis 
 for community detection.
SUMMARY:Combining Link and Content for Community Detection: A Discriminat
 ive Approach
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T105500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4965
DESCRIPTION:There is a growing number of service providers that a consume
 r can interact with over the web to learn their service terms. The servi
 ce terms\, such as price and time to completion of the service\, depend 
 on the consumer's particular specifications. For instance\, a printing s
 ervices provider would need from its customers specifications such as th
 e size of paper\, type of ink\, proofing and perforation. In a few secto
 rs\, there exist marketplace sites that provide consumers with specifica
 tions forms\, which the consumer can fill out to learn the service terms
  of multiple service providers. Unfortunately\, there are only a few suc
 h marketplace sites\, and they cover a few sectors.\n\nAt HP Labs\, we a
 re working towards building a universal marketplace site\, i.e.\, a mark
 etplace site that covers thousands of sectors and hundreds of providers 
 per sector. One issue in this domain is the automated discovery/retrieva
 l of the specifications for each sector. We address it through extractin
 g and analyzing content from the websites of the service providers liste
 d in business directories. The challenge is that each service provider i
 s often listed under multiple service categories in a business directory
 \, making it infeasible to utilize standard supervised learning techniqu
 es. We address this challenge through employing a multilabel statistical
  clustering approach within an expectation-maximization framework. We im
 plement our solution to retrieve specifications for 3000 sectors\, repre
 senting more than 300\,000 service providers. We discuss our results wit
 hin the context of the services needed to design a marketing campaign fo
 r a small business.
SUMMARY:Towards a Universal Marketplace over the Web: Statistical Multi-l
 abel Classification of Service Provider Forms with Simulated Annealing
LOCATION:Louis Armstrong A\,B\,C+D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T165000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T162500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4966
DESCRIPTION:Anomalous windows are the contiguous groupings of data points
 . In this paper\, we propose an approach for discovering anomalous windo
 ws using Scan Statistics for Linear Intersecting Paths (SSLIP). A linear
  path refers to a path represented by a line with a single dimensional s
 patial coordinate marking an observation point. Our approach for discove
 ring anomalous windows along linear paths comprises of the following dis
 tinct steps: (a) Cross Path Discovery: where we identify a subset of int
 ersecting paths to be considered\, (b) Anomalous Window Discovery: where
  we outline three order invariant algorithms\, namely SSLIP\, Brute Forc
 e-SSLIP and Central Brute Force-SSLIP\, for the traversal of the cross p
 aths to identify varying size directional windows along the paths. For i
 dentifying an anomalous window we compute an unusualness metric\, in the
  form of a likelihood ratio to indicate the degree of unusualness of thi
 s window with respect to the rest of the data. We identify the window wi
 th the highest likelihood ratio as our anomalous window\, and (c) Monte 
 Carlo Simulations: to ascertain whether this window is truly anomalous a
 nd not just a random occurrence we perform hypothesis testing by computi
 ng a p-value using Monte Carlo Simulations. We present extensive experim
 ental results in real world accident datasets for various highways with 
 known issues(code and data available from [27]\, [21]). Our results show
  that our approach indeed is effective in identifying anomalous traffic 
 accident windows along multiple intersecting highways.
SUMMARY:Anomalous Window Discovery through Scan Statistics for Linear Int
 ersecting Paths (SSLIP)
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T162500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T160000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4967
DESCRIPTION:Classifying nodes in networks is a task with a wide range of 
 applications. It can be particularly useful in anomaly and fraud detecti
 on. Many resources are invested in the task of fraud detection due to th
 e high cost of fraud\, and being able to automatically detect potential 
 fraud quickly  and precisely allows human investigators to work more eff
 iciently. Many data analytic schemes have been put into use\; however\, 
 schemes that bolster link analysis prove promising. This work builds upo
 n the belief propagation algorithm for use in detecting collusion and ot
 her fraud schemes. We propose an algorithm called SNARE (Social Network 
 Analysis for Risk Evaluation). By allowing one to use domain knowledge a
 s well as link knowledge\, the method was very successful for pinpointin
 g misstated accounts in our sample of general ledger data\, with a signi
 ficant improvement over the default heuristic in true positive rates\, a
 nd a lift factor of up to 6.5 (more than twice that of the default heuri
 stic). We also apply SNARE to the task of graph labeling in general on p
 ublicly-available datasets. We show that with only some information abou
 t the nodes themselves in a network\, we get surprisingly high accuracy 
 of labels. Not only is SNARE applicable in a wide variety of domains\, b
 ut it is also robust to the choice of parameters and highly scalable li
 nearly with the number of edges in a graph.
SUMMARY:SNARE: A Link Analytic System for Graph Labeling and Risk Detecti
 on
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T112000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T105500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4968
DESCRIPTION:In this paper\, we consider a novel scheme referred to as Car
 tesian contour to concisely represent the collection of frequent itemset
 s. Different from the existing works\, this scheme provides a complete v
 iew of these itemsets by covering the entire collection of them. More in
 terestingly\, it takes a first step in deriving a generative view of the
  frequent pattern formulation\, i.e.\, how a small number of patterns in
 teract with each other and produce the complexity of frequent itemsets. 
 We perform a theoretical investigation of the concise representation pro
 blem and link it to the biclique set cover problem and prove its NP-hard
 ness. We develop a novel approach utilizing the technique developed in f
 requent itemset mining\, set cover\, and max k-cover to approximate the 
 minimal biclique set cover problem. In addition\, we consider several he
 uristic techniques to speedup the construction of Cartesian contour. The
  detailed experimental study demonstrates the effectiveness and efficien
 cy of our approach.
SUMMARY:Cartesian Contour: A Concise Representation for a Collection of F
 requent Sets
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T162500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T160000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4969
DESCRIPTION:Given a quarter of petabyte click log data\, how can we estim
 ate the relevance of each URL for a given query? In this paper\, we prop
 ose the Bayesian Browsing Model (BBM)\, a new modeling technique with fo
 llowing advantages:\n (a) it does exact inference\;\n (b) it is single-p
 ass and parallelizable\;\n (c) it is effective.\n\nWe present two sets o
 f experiments to test model effectiveness and efficiency. On the first s
 et of over 50 million search instances of 1.1 million distinct queries\,
  BBM outperforms the state-of-the-art competitor by 29.2% in log-likelih
 ood while being 57 times faster. On the second click-log set\, spanning 
 a quarter of petabyte data\, we showcase the scalability of BBM: we impl
 emented it on a commercial MapReduce cluster\, and it took only 3 hours 
 to compute the relevance for 1.15 billion distinct query-URL pairs.
SUMMARY:BBM: Bayesian Browsing Model from Petabyte-scale Data
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T115000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T113500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4970
DESCRIPTION:A recent study by two prominent finance researchers\, Fama an
 d French\, introduces a new framework for studying risk vs. return: the 
 migration of stocks across size-value portfolio space.  Given the financ
 ial events of 2008\, this first attempt to disentangle the relationships
  between migration behavior and stock returns is especially timely.  The
 ir work\, however\, derives results only for market segments\, not indiv
 idual companies\, and only for one-year moves. Thus\, we see a new chall
 enge for financial data mining: how to capture and categorize the migrat
 ion of individual companies\, and how such behavior affects their return
 s.\n\nWe propose a novel data mining approach to study the multi-year mo
 vement of individual companies. Specifically\, we address the question: 
 ``How does one discover frequent migration patterns in the stock market?
 '' We present a new trajectory mining algorithm to discover migration mo
 tifs in financial markets. Novel features of this algorithm are its hand
 ling of approximate pattern matching through a graph theoretical method\
 , maximal clique identification\, and incorporation of temporal and spat
 ial constraints. We have performed a detailed study of the NASDAQ\, NYSE
 \, and AMEX stock markets\, over a 43-year span. We successfully find mi
 gration motifs that confirm existing finance theories and other motifs t
 hat may lead to new financial models.
SUMMARY:Migration Motif: A Spatial-Temporal Pattern Mining Approach for F
 inancial Markets
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T114500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T112000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4971
DESCRIPTION:This paper tackles the problem of summarizing frequent itemse
 ts. We observe that previous notions of summaries cannot be directly use
 d for analyzing frequent itemsets. In order to be used for analysis\, on
 e requirement is that the analysts should be able to browse all frequent
  itemsets by only having the summary.\n\nFor this purpose\, we propose t
 o build the summary based upon a novel formulation\, conditional profile
  (or c-profile). Several features of our proposed summary are: (1) each 
 profile in the summary can be analyzed independently\, (2) it provides e
 rror guarantee (e-adequate)\, and (3) it produces no false positives or 
 false negatives.\n\nHaving the formulation\, the next challenge is to pr
 oduce the most concise summary which satisfies the requirement. In this 
 paper\, we also designed an algorithm which is both effective and effici
 ent for this task. The quality of our approach is justified by extensive
  experiments.\n\nThe implementations for the algorithms are available fr
 om www.ntu.edu.sg/home/asvivek/pubs/cprofile09.
SUMMARY:CP-Summary: A Concise Representation for Browsing Frequent Itemse
 ts
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T162500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T160000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4972
DESCRIPTION:In our work\, we address the problem of modeling social netwo
 rk generation which explains both link and group formation. Recent studi
 es on social network evolution propose generative models which capture t
 he statistical properties of real-world networks related only to node-to
 -node link formation. We propose a novel model which captures the co-evo
 lution of social and affiliation networks. We provide surprising insight
 s into group formation based on observations in several real-world netwo
 rks\, showing that users often join groups for reasons other than their 
 friends. Our experiments show that the model is able to capture both the
  newly observed and previously studied network properties. This work is 
 the first to propose a generative model which captures the statistical p
 roperties of these complex networks. The proposed model facilitates cont
 rolled experiments which study the effect of actors' behavior on the evo
 lution of affiliation networks\, and it allows the generation of realist
 ic synthetic datasets.
SUMMARY:Co-evolution of Social and Affiliation Networks
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T105500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4973
DESCRIPTION:Effective diagnosis of Alzheimer's disease (AD)\, the most co
 mmon type of dementia in elderly patients\, is of primary importance in 
 biomedical research. Recent studies have demonstrated that AD is closely
  related to the structure change of the brain network\, i.e.\, the conne
 ctivity among different brain regions. The connectivity patterns will pr
 ovide useful imaging-based biomarkers to distinguish\nNormal Controls (N
 C)\, patients with Mild Cognitive Impairment (MCI)\, and patients with A
 D. In this paper\, we investigate the sparse inverse covariance estimati
 on technique for identifying the connectivity among different brain regi
 ons. In particular\, a novel algorithm based on the block coordinate des
 cent approach is proposed for the direct estimation of the inverse covar
 iance matrix. One appealing feature of the proposed algorithm is that it
  allows the user feedback (e.g.\, prior domain knowledge) to be incorpor
 ated into the estimation process\, while the connectivity patterns can b
 e discovered automatically. We apply the proposed algorithm to a collect
 ion of FDG-PET images from 232 NC\, MCI\, and AD subjects. Our experimen
 tal results demonstrate that the proposed algorithm is promising in reve
 aling the brain region connectivity differences among these groups.
SUMMARY:Mining Brain Region Connectivity for Alzheimer's Disease Study vi
 a Sparse Inverse Covariance Estimation
LOCATION:La Seine C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T171000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T165500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4974
DESCRIPTION:This paper aims at discovering community structure in rich me
 dia social networks\, through analysis of time-varying\, multi-relationa
 l data. Community structure represents the latent social context of user
  actions. It has important applications in information tasks such as sea
 rch and recommendation. Social media has several unique challenges. (a) 
 In social media\, the context of user actions is constantly changing and
  co-evolving\; hence the social context contains time-evolving multi-dim
 ensional relations. (b) The social context is determined by the availabl
 e system features and is unique in each social media website. In this pa
 per we propose MetaFac (MetaGraph Factorization)\, a framework that extr
 acts community structures from various social contexts and interactions.
  Our work has three key contributions: (1) metagraph\, a novel relationa
 l hypergraph representation for modeling multi-relational and multi-dime
 nsional social data\; (2) an efficient factorization method for communit
 y extraction on a given metagraph\; (3) an on-line method to handle time
 -varying relations through incremental metagraph factorization. Extensiv
 e experiments on real-world social data collected from the Digg social m
 edia website suggest that our technique is scalable and is able to extra
 ct meaningful communities based on the social media contexts. We illustr
 ate the usefulness of our framework through prediction tasks. We outperf
 orm baseline methods (including aspect model and tensor analysis) by an 
 order of magnitude.
SUMMARY:MetaFac: Community Discovery via Relational Hypergraph Factorizat
 ion
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T173500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T172000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4975
DESCRIPTION:This paper describes and evaluates privacy-friendly methods f
 or extracting quasi-social networks from browser behavior on user-genera
 ted content sites\, for the purpose of finding good audiences for brand 
 advertising (as opposed to click maximizing\, for example).  Targeting s
 ocial-network neighbors resonates well with advertisers\, and on-line br
 owsing behavior data counterintuitively can allow the identification of 
 good audiences anonymously.  Besides being one of the first papers to ou
 r knowledge on data mining for on-line brand advertising\, this paper ma
 kes several important contributions. We introduce a framework  for evalu
 ating brand audiences\, in analogy to predictive-modeling holdout evalua
 tion. We introduce methods for    extracting quasi-social networks from 
 data on visitations to social    networking pages\, without collecting a
 ny information on the identities of the browsers or the content of the s
 ocial-network pages.\n\nWe introduce measures of brand proximity in the 
 network\, and show that audiences with high brand proximity indeed show 
 substantially higher brand affinity. Finally\, we provide evidence that 
 the quasi-social network embeds a true social network\, which along with
  results from social theory offers one explanation for the increase in b
 rand affinity of the selected audiences.
SUMMARY:Audience Selection for On-line Brand Advertising: Privacy-friendl
 y Social Network Targeting
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T112000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T105500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4976
DESCRIPTION:Mining discrete patterns in binary data is important for subs
 ampling\, compression\, and clustering. We consider rank-one binary matr
 ix approximations that identify the dominant patterns of the data\, whil
 e preserving its discrete property. A best approximation on such data ha
 s a minimum set of inconsistent entries\, i.e.\, mismatches between the 
 given binary data and the approximate matrix. Due to the hardness of the
  problem\, previous accounts of such problems employ heuristics and the 
 resulting approximation may be far away from the optimal one. In this pa
 per\, we show that the rank-one binary matrix approximation can be refor
 mulated as a 0-1 integer linear program (ILP). However\, the ILP formula
 tion is computationally expensive even for small-size matrices. We propo
 se a linear program (LP) relaxation\, which is shown to achieve a guaran
 teed approximation error bound. We further extend the proposed formulati
 ons using the regularization technique\, which is commonly employed to a
 ddress overfitting. The LP formulation is restricted to medium-size matr
 ices\, due to the large number of variables involved for large matrices.
  Interestingly\, we show that the proposed approximate formulation can b
 e transformed into an instance of the minimum s-t cut problem\, which ca
 n be solved efficiently by finding maximum flows. Our empirical study sh
 ows the efficiency of the proposed algorithm based on the maximum flow. 
 Results also confirm the established theoretical bounds.
SUMMARY:Mining Discrete Patterns via Binary Matrix Factorization
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T121500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T120000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4977
DESCRIPTION:The Drosophila gene expression pattern images document the sp
 atial and temporal dynamics of gene expression and they are valuable too
 ls for explicating the gene functions\, interaction\, and networks durin
 g Drosophila embryogenesis. To provide text-based pattern searching\, th
 e images in the Berkeley Drosophila Genome Project (BDGP) study are anno
 tated with ontology terms manually by human curators. We present a syste
 matic approach for automating this task\, because the number of images n
 eeding text descriptions is now rapidly increasing. We consider both imp
 roved feature representation and novel learning formulation to boost the
  annotation performance. For feature representation\, we adapt the bag-o
 f-words scheme commonly used in visual recognition problems so that the 
 image group information in the BDGP study is retained. Moreover\, images
  from multiple views can be integrated naturally in this representation.
  To reduce the quantization error caused by the bag-of-words representat
 ion\, we propose an improved feature representation scheme based on the 
 sparse learning technique. In the design of learning formulation\, we pr
 opose a local regularization framework that can incorporate the correlat
 ions among terms explicitly. We further show that the resulting optimiza
 tion problem admits an analytical solution. Experimental results show th
 at the representation based on sparse learning outperforms the bag-of-wo
 rds representation significantly. Results also show that incorporation o
 f the term-term correlations improves the annotation performance consist
 ently.
SUMMARY:Drosophila Gene Expression Pattern Annotation Using Sparse Featur
 es and Term-term Interactions
LOCATION:La Seine C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T172000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T170500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4978
DESCRIPTION:Topic models provide a powerful tool for analyzing large text
  collections by representing high dimensional data in a low dimensional 
 subspace. Fitting a topic model given a set of training documents requir
 es approximate inference techniques that are computationally expensive. 
 With today's large-scale\, constantly expanding document collections\, i
 t is useful to be able to infer topic distributions for new documents wi
 thout retraining the model.\n\nIn this paper\, we empirically evaluate t
 he performance of several methods for topic inference in previously unse
 en documents\, including methods based on Gibbs sampling\, variational i
 nference\, and a new method inspired by text classification. The classif
 ication-based inference method produces results similar to iterative inf
 erence methods\, but requires only a single matrix multiplication. In ad
 dition to these inference methods\, we present SparseLDA\, an algorithm 
 and data structure for evaluating Gibbs sampling distributions. Empirica
 l results indicate that SparseLDA can be approximately 20 times faster t
 han traditional LDA and provide twice the speedup of previously publishe
 d fast sampling methods\, while also using substantially less memory.
SUMMARY:Efficient Methods for Topic Model Inference on Streaming Document
  Collections
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T111000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T105500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4979
DESCRIPTION:We develop and evaluate an approach to causal modeling based 
 on time series data\, collectively referred to as "grouped graphical Gra
 nger modeling methods." Graphical Granger modeling uses graphical modeli
 ng techniques on time series data and invokes the notion of "Granger cau
 sality" to make assertions on causality among a potentially large number
  of time series variables through inference on time-lagged effects. The 
 present paper proposes a novel enhancement to the graphical Granger meth
 odology by developing and applying families of regression methods that a
 re sensitive to group information among variables\, to leverage the grou
 p structure present in the lagged temporal variables according to the ti
 me series they belong to. Additionally\, we propose a new family of algo
 rithms we call group boosting\, as an improved component of grouped grap
 hical Granger modeling over the existing regression methods with grouped
  variable selection in the literature (e.g group Lasso). The introductio
 n of group boosting methods is primarily motivated by the need to deal w
 ith non-linearity in the data. We perform empirical evaluation to confir
 m the advantage of the grouped graphical Granger methods over the standa
 rd (non-grouped) methods\, as well as that specific to the methods based
  on group boosting. This advantage is also demonstrated for the real wor
 ld application of gene regulatory network discovery from time-course mic
 roarray data.
SUMMARY:Grouped Graphical Granger Modeling Methods for Temporal Causal Mo
 deling
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T153000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T151500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4980
DESCRIPTION:Online forums represent one type of social media that is part
 icularly rich for studying human behavior in information seeking and dif
 fusing. The way users join communities is a reflection of the changing a
 nd expanding of their interests toward information. In this paper\, we s
 tudy the patterns of user participation behavior\, and the feature facto
 rs that influence such behavior on different forum datasets. We find tha
 t\, despite the relative randomness and lesser commitment of structural 
 relationships in online forums\, users' community joining behaviors disp
 lay some strong regularities. One particularly interesting observation i
 s that the very weak relationships between users defined by online repli
 es have similar diffusion curves as those of real friendships or co-auth
 orships. We build social selection models\, Bipartite Markov Random Fiel
 d (BiMRF)\, to quantitatively evaluate the prediction performance of tho
 se feature factors and their relationships. Using these models\, we show
  that some features carry supplementary information\, and the effectiven
 ess of different features vary in different types of forums. Moreover\, 
 the results of BiMRF with two-star configurations suggest that the featu
 re of user similarity defined by frequency of communication or number of
  common friends is inadequate to predict grouping behavior\, but adding 
 node-level features can improve the fit of the model.
SUMMARY:User Grouping Behavior in Online Forums
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T165000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T162500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4981
DESCRIPTION:Our dynamic graph-based relational mining approach has been d
 eveloped to learn structural patterns in biological networks as they cha
 nge over time. The analysis of dynamic networks is important not only to
  understand life at the system-level\, but also to discover novel patter
 ns in other structural data. Most current graph-based data mining approa
 ches overlook dynamic features of biological networks\, because they are
  focused on only static graphs. Our approach analyzes a sequence of grap
 hs and discovers rules that capture the changes that occur between pairs
  of graphs in the sequence. These rules represent the graph rewrite rule
 s that the first graph must go through to be isomorphic to the second gr
 aph. Then\, our approach feeds the graph rewrite rules into a machine le
 arning system that learns general transformation rules describing the ty
 pes of changes that occur for a class of dynamic biological networks. Th
 e discovered graph-rewriting rules show how biological networks change o
 ver time\, and the transformation rules show the repeated patterns in th
 e structural changes. In this paper\, we apply our approach to biologica
 l networks to evaluate our approach and to understand how the biosystems
  change over time. We evaluate our results using coverage and prediction
  metrics\, and compare to biological literature.
SUMMARY:Learning Patterns in the Dynamics of Biological Networks
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T151500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T145000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4982
DESCRIPTION:The most common environment in which ranking is used takes a 
 very specific form. Users sequentially generate queries in a digital lib
 rary. For each query\, ranking is applied to order a set of relevant ite
 ms from which the user selects his favorite. This is the case when ranki
 ng search results for pages on the World Wide Web or for merchandize on 
 an e-commerce site. In this work\, we present a new online ranking algor
 ithm\, called NoRegret KLRank. Our algorithm is designed to use "clickth
 rough" information as it is provided by the users to improve future rank
 ing decisions. More importantly\, we show that its long term average per
 formance will converge to the best rate achievable by any competing fixe
 d ranking policy selected with the benefit of hindsight. We show how to 
 ensure that this property continues to hold as new items are added to th
 e set thus requiring a richer class of ranking policies. Finally\, our e
 mpirical results show that\, while in some context NoRegret KLRank might
  be considered conservative\, a greedy variant of this algorithm actuall
 y outperforms many popular ranking algorithms.
SUMMARY:Regret-based Online Ranking for a Growing Digital Library
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T111000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T105500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4983
DESCRIPTION:We consider the problem of producing recommendations from col
 lective user behavior while simultaneously providing guarantees of priva
 cy for these users. Specifically\, we consider the Netflix Prize data se
 t\, and its leading algorithms\, adapted to the framework of differentia
 l privacy.\n\nUnlike prior privacy work concerned with cryptographically
  securing the  computation of recommendations\, differential privacy con
 strains a computation in a way that precludes any inference about the un
 derlying records from its output. Such algorithms necessarily introduce 
 uncertainty---i.e.\, noise---to computations\, trading accuracy for priv
 acy.\n\nWe find that several of the leading approaches in the Netflix Pr
 ize competition can be adapted to provide differential privacy\, without
  significantly degrading their accuracy. To adapt these algorithms\, we 
 explicitly factor them into two parts\, an aggregation/learning phase th
 at can be performed with differential privacy guarantees\, and an indivi
 dual recommendation phase that uses the learned correlations and an indi
 vidual's data to provide personalized recommendations. The adaptations a
 re non-trivial\, and involve both careful analysis of the per-record sen
 sitivity of the algorithms to calibrate noise\, as well as new post-proc
 essing steps to mitigate the impact of this noise.\n\nWe measure the emp
 irical trade-off between accuracy and privacy in these adaptations\, and
  find that we can provide non-trivial formal privacy guarantees while st
 ill outperforming the Cinematch baseline Netflix provides.
SUMMARY:Differentially Private Recommender Systems: Building Privacy into
  the Netflix Prize Contenders
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T121500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4984
DESCRIPTION:The Affinity Propagation (AP) clustering algorithm proposed b
 y Frey and Dueck (2007) provides an understandable\, nearly optimal summ
 ary of a dataset\, albeit with quadratic computational complexity. This 
 paper\, motivated by Autonomic Computing\, extends AP to the data stream
 ing framework. Firstly a hierarchical strategy is used to reduce the com
 plexity to ${\\cal O}(N^{1+\\e})$\; the distortion loss incurred is anal
 yzed in relation with the dimension of the data items. Secondly\, a coup
 ling with a change detection test is used to cope with non-stationary da
 ta distribution\, and rebuild the model as needed. The presented approac
 h StrAP is applied to the stream of jobs submitted to the EGEE Grid\, pr
 oviding an understandable description of the job flow and enabling the s
 ystem administrator to spot online some sources of failures.
SUMMARY:Toward Autonomic Grids: Analyzing the Job Flow with Affinity Stre
 aming
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T121500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T120000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4985
DESCRIPTION:Most of recommender systems try to find items that are most r
 elevant to the older choices of a given user. Here we focus on the "surp
 rise me" query: A user may be bored with his/her usual genre of items (e
 .g.\, books\, movies\, hobbies)\, and may want a recommendation that is 
 related\, but off the beaten path\, possibly leading to a new genre of b
 ooks/movies/hobbies.\n\nHow would we define\, as well as automate\, this
  seemingly selfcontradicting request? We introduce TANGENT\, a novel rec
 ommendation algorithm to solve this problem. The main idea behind TANGEN
 T is to envision the problem as node selection on a graph\, giving high 
 scores to nodes that are well connected to the older choices\, and at th
 e same time well connected to unrelated choices. The method is carefully
  designed to be (a) parameter-free (b) effective and (c) fast. We illust
 rate the benefits of TANGENT with experiments on both synthetic and real
  data sets. We show that TANGENT makes reasonable\, yet surprising\, hor
 izon-broadening recommendations. Moreover\, it is fast and scalable\, si
 nce it can easily use existing fast algorithms on graph node proximity.
SUMMARY:TANGENT: A Novel\, 'Surprise-me' Recommendation Algorithm
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T121500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T120000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4986
DESCRIPTION:In this article\, we report our efforts in mining the informa
 tion encoded as clickthrough data in the server logs to evaluate and mon
 itor the relevance ranking quality of a commercial web search engine. We
  describe a metric called pSkip that aims to quantify the ranking qualit
 y by estimating the probability of users encountering non relevant resul
 ts that cost them the efforts to read and skip. A search engine with a l
 ower pSkip is regarded as having a better ranking quality. A key design 
 goal of pSkip is to integrate the findings from two sets of user studies
  that utilize eye-tracking devices to track users browsing patterns on 
 the search result pages\, and that use specially instrumented browsers t
 o actively solicit users explicit judgments on their search activities.
  We present the derivation of the maximum likelihood estimation of pSkip
  and demonstrate its efficacy in describing the user study data. The mat
 hematical properties of pSkip are further analyzed and compared with sev
 eral objective metrics as well as the cumulated gain method that uses su
 bjective judgments. Experimental data show that pSkip can measure aspect
 s of the search quality that these existing metrics are not designed or 
 fail to address\, such as identifying the real search intents expressed 
 in the ambiguous queries. Although effective and superior in many ways\,
  we also report a series of experiments that show pSkip may be influence
 d by system issues that are not directly related to relevance ranking\, 
 suggesting that measurements complementary to pSkip are still needed in 
 order to form a holistic and accurate characterization of the ranking qu
 ality.
SUMMARY:PSkip: Estimating relevance ranking quality from web search click
 through data
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T173500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T172000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4987
DESCRIPTION:Learning temporal graph structures from time series data reve
 als important dependency relationships between current observations and 
 histories. Most previous work focuses on learning and predicting with "s
 tatic" temporal graphs only. However\,  in many applications such as mec
 hanical systems and biology systems\, the temporal dependencies might ch
 ange over time. In this paper\, we develop a dynamic temporal graphical 
 models based on hidden Markov model regression and lasso-type algorithms
 . Our method is able to integrate two usually separate tasks\, i.e. infe
 rring underlying states and learning temporal graphs\, in one unified mo
 del. The output temporal graphs provide better understanding about compl
 ex systems\, i.e. how their dependency graphs evolve over time\, and ach
 ieve more accurate predictions.  We examine our model on two synthetic d
 atasets as well as a real application dataset for monitoring oil-product
 ion equipment to capture different stages of the system\, and achieve pr
 omising results.
SUMMARY:Learning Dynamic Temporal Graphs for Oil-production Equipment Mon
 itoring System
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T114500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T112000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4988
DESCRIPTION:A heterogeneous information network is an information network
  composed of multiple types of objects. Clustering on such a network may
  lead to better understanding of both hidden structures of the network a
 nd the individual role played by every object in each cluster.  However\
 , although clustering on homogeneous networks has been studied over deca
 des\, clustering on heterogeneous networks has not been addressed until 
 recently.\n\nA recent study proposed a new algorithm\, RankClus\, for cl
 ustering on bi-typed heterogeneous networks.  However\, a real-world net
 work may consist of more than two types\, and the interactions among mul
 ti-typed objects play a key role at disclosing the rich semantics that a
  network carries.  In this paper\, we study clustering of multi-typed he
 terogeneous networks with a star network schema and propose a novel algo
 rithm\, \\NetClus\, that utilizes links across multi-typed objects to ge
 nerate high-quality net-clusters. An iterative enhancement method is dev
 eloped that leads to effective ranking-based clustering in such heteroge
 neous networks. Our experiments on DBLP data show that \\NetClus\\ gener
 ates more accurate clustering results than the baseline topic model algo
 rithm PLSA and the recently proposed algorithm\, RankClus. Further\, \\N
 etClus\\ generates informative clusters\, presenting good ranking and cl
 uster membership information for each attribute object in each net-clust
 er.
SUMMARY:Ranking-Based Clustering of Heterogeneous Information Networks wi
 th Star Network Schema
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T172500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T171000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4989
DESCRIPTION:Active and semi-supervised learning are important techniques 
 when labeled data are scarce. Recently a method was suggested for combin
 ing active learning with a semi-supervised learning algorithm that uses 
 Gaussian fields and harmonic functions. This classifier is relational in
  nature: it relies on having the data presented as a partially labeled g
 raph (also known as a within-network learning problem). This work showed
  yet again that empirical risk minimization (ERM) was the best method to
  find the next instance to label and provided an efficient way to comput
 e ERM with the semi-supervised classifier. The computational problem wit
 h ERM is that it relies on computing the risk for all possible instances
 . If we could limit the candidates that should be investigated\, then we
  can speed up active learning considerably. In the case where the data i
 s graphical in nature\, we can leverage the graph structure to rapidly i
 dentify instances that are likely to be good candidates for labeling. Th
 is paper describes a novel hybrid approach of using of community finding
  and social network analytic centrality measures to identify good candid
 ates for labeling and then using ERM to find the best instance in this c
 andidate set. We show on real-world data that we can limit the ERM compu
 tations to a fraction of instances with comparable performance.
SUMMARY:Using Graph-based Metrics with Empirical Risk Minimization to Spe
 ed Up Active Learning on Networked Data
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T115000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T113500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4990
DESCRIPTION:Merchants selling products on the Web often ask their custome
 rs to share their opinions and hands-on experiences on products they hav
 e purchased. Unfortunately\, reading through all customer reviews is dif
 ficult\, especially for popular items\, the number of reviews can be up 
 to hundreds or even thousands. This makes it difficult for a potential c
 ustomer to read them to make an informed decision. The OpinionMiner syst
 em designed in this work aims to mine customer reviews of a product and 
 extract high detailed product entities on which reviewers express their 
 opinions. Opinion expressions are identified and opinion orientations fo
 r each recognized product entity are classified as positive or negative.
  Different from previous approaches that employed rule-based or statisti
 cal techniques\, we propose a novel machine learning approach built unde
 r the framework of lexicalized HMMs. The approach naturally integrates m
 ultiple important linguistic features into automatic learning. In this p
 aper\, we describe the architecture and main components of the system. T
 he evaluation of the proposed method is presented based on processing th
 e online product reviews from Amazon and other publicly available datase
 ts.
SUMMARY:OpinionMiner: A Novel Machine Learning System for Web Opinion Min
 ing and Extraction
LOCATION:Louis Armstrong A\,B\,C+D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T172000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T170500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4991
DESCRIPTION:In the past few years there has been an increasing interest\n
 in the analysis of process logs. Several proposed techniques\, such as w
 orkflow mining\, are aimed at automatically deriving the underlying work
 flow models. However\, current approaches only pay little attention on a
 n important piece of information contained in process logs: the timestam
 ps\, which are used to define a sequential ordering of the performed tas
 ks. In this work we try to overcome these limitations by explicitly incl
 uding time in the extracted knowledge\, thus making the temporal informa
 tion a first-class citizen of the analysis process.  This makes it possi
 ble to discern between apparently identical process executions that are 
 performed with different transition times between consecutive tasks.\n\n
 This paper proposes a framework for the user-interactive exploration of 
 a condensed representation of groups of executions of a given process. T
 he framework is based on the use of an existing mining paradigm: Tempora
 lly-Annotated Sequences (TAS). These are aimed at extracting sequential 
 patterns where each transition between two events is annotated with a ty
 pical transition time that emerges from input data. With the extracted T
 AS\, which represent sets of possible frequent executions with their typ
 ical transition times\, a few factorizing operators are built. These ope
 rators condense such executions according to possible parallel or possib
 le mutual exclusive executions. Lastly\, such condensed representation i
 s rendered to the user via the exploration graph\, namely the Temporally
 -Annotated Graph (TAG). The user\, the domain expert\, is allowed to exp
 lore the different and alternative factorizations corresponding to diffe
 rent interpretations of the actual executions. According to the user cho
 ices\, the system discards or retains certain hypotheses on actual execu
 tions and shows the consequent scenarios resulting from the coresponding
  re-aggregation of the actual data.
SUMMARY:Temporal mining for interactive workflow data analysis
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T152500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T151000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4992
DESCRIPTION:Email is one of the most prevalent communication tools today\
 , and solving the email overload problem is pressingly urgent. A good wa
 y to alleviate email overload is to automatically prioritize received me
 ssages according to the priorities of each user. However\, research on s
 tatistical learning methods for fully personalized email prioritization 
 (PEP) has been sparse due to privacy issues\, since people are reluctant
  to share personal messages and importance judgments with the research c
 ommunity. It is therefore important to develop and evaluate PEP methods 
 under the assumption that only limited training examples can be availabl
 e\, and that the system can only have the personal email data of each us
 er during the training and testing of the model for that user. This pape
 r presents the first study (to the best of our knowledge) under such an 
 assumption. Specifically\, we focus on analysis of personal social netwo
 rks to capture user groups and to obtain rich features that represent th
 e social roles from the viewpoint of a particular user.  We also develop
 ed a novel semi-supervised (transductive) learning algorithm that propag
 ates importance labels from training examples to test examples through m
 essage and user nodes in a personal email network.  These methods togeth
 er enable us to obtain an enriched vector representation of each new ema
 il message\, which consists of both standard features of an email messag
 e (such as words in the title or body\, sender and receiver IDs\, etc.) 
 and the induced social features from the sender and receivers of the mes
 sage.  Using the enriched vector representation as the input in SVM clas
 sifiers to predict the importance level for each test message\, we obtai
 ned significant performance improvement over the baseline system (withou
 t induced social features) in our experiments on a multi-user data colle
 ction. We obtained significant performance improvement over the baseline
  system (without induced social features) in our experiments on a multi-
 user data collection: the relative error reduction in MAE was 31% in mic
 ro-averaging\, and 14% in macro-averaging.
SUMMARY:Mining Social Networks for Personalized Email Prioritization
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T173500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T172000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4993
DESCRIPTION:The Border Gateway Protocol (BGP) is one of the fundamental c
 omputer communication protocols. Monitoring and mining BGP update messag
 es can directly reveal the health and stability of Internet routing. Her
 e we make two contributions: firstly we find patterns in BGP updates\, l
 ike self-similarity\, power-law and lognormal marginals\; secondly using
  these patterns\, we find anomalies. Specifically\, we develop BGP-lens\
 , an automated BGP updates analysis tool\, that has three desirable prop
 erties: (a) It is effective\, able to identify phenomena that would othe
 rwise go unnoticed\, such as a peculiar `clothesline' behavior or prolon
 ged `spikes' that last as long as 8 hours\; (b) It is scalable\, using a
 lgorithms are all linear on the number of time-ticks\; and (c) It is adm
 in-friendly\, giving useful leads for phenomenon of interest.\n\nWe show
 case the capabilities of BGP-lens by identifying surprising phenomena ve
 rified by syadmins\, over a massive trace of BGP updates spanning 2 year
 s\, from the publicly available site datapository.net.
SUMMARY:BGP-lens: Patterns and Anomalies in Internet Routing Updates
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T172000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T170500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4994
DESCRIPTION:Search queries are typically very short\, which means they ar
 e often underspecified or have senses that the user did not think of.  A
  broad latent query aspect is a set of keywords that succinctly represen
 ts one particular sense\, or one particular information need\, that can 
 aid users in reformulating such queries.  We extract such broad latent a
 spects from query reformulations found in historical search session logs
 .  We propose a framework under which the problem of extracting such bro
 ad latent aspects reduces to that of optimizing a formal objective funct
 ion under constraints on the total number of aspects the system can stor
 e\, and the number of aspects that can be shown in response to any given
  query.  We present algorithms to find a good set of aspects\, and also 
 to pick the best $k$ aspects matching any query. Empirical results on re
 al-world search engine logs show significant gains over a strong baselin
 e that uses single-keyword reformulations: a gain of $14\\%$ and $23\\%$
  in terms of human-judged accuracy and click-through data respectively\,
  and around $20\\%$ in terms of consistency among aspects predicted for 
 ``similar'' queries. This demonstrates both the importance of broad quer
 y aspects\, and the efficacy of our algorithms for extracting them.
SUMMARY:Mining Broad Latent Query Aspects from Search Sessions
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T113500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T112000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4995
DESCRIPTION:The discovery of biclusters\, which denote groups of items th
 at show coherent values across a subset of all the transactions in a dat
 a set\, is an important type of analysis performed on real-valued data s
 ets in various domains\, such as biology. Several algorithms have been p
 roposed to find different types of biclusters in such data sets. However
 \, these algorithms are unable to search the space of all possible biclu
 sters exhaustively. Pattern mining algorithms in association analysis al
 so essentially produce biclusters as their result\, since the patterns c
 onsist of items that are supported by a subset of all the transactions. 
 However\, a major limitation of the numerous techniques developed in ass
 ociation analysis is that they are only able to analyze data sets with b
 inary and/or categorical variables\, and their application to real-value
 d data sets often involves some lossy transformation such as discretizat
 ion or binarization of the attributes. In this paper\, we propose a nove
 l association analysis framework for exhaustively and efficiently mining
  "range support" patterns from such a data set. On one hand\, this frame
 work reduces the loss of information incurred by the binarization- and d
 iscretization-based approaches\, and on the other\, it enables the exhau
 stive discovery of coherent \nbiclusters. We compared the performance of
  our framework with two standard biclustering algorithms through the eva
 luation of the similarity of the cellular functions of the genes constit
 uting the patterns/biclusters derived by these algorithms from microarra
 y data. These experiments show that the real-valued patterns discovered 
 by our framework are better enriched by small biologically interesting f
 unctional classes. Also\, through specific examples\, we demonstrate the
  ability of the RAP framework to discover functionally enriched patterns
  that are not found by the commonly used biclustering algorithm ISA. The
  source code and data sets used in this paper\, as well as the supplemen
 tary material\, are available at http://www.cs.umn.edu/vk/gaurav/rap.
SUMMARY:An Association Analysis Approach to Biclustering
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T164000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T162500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4996
DESCRIPTION:Social media such as blogs\, Facebook\, Flickr\, etc.\, prese
 nts data in a network format rather than classical IID distribution.  To
  address the interdependency among data instances\, relational learning 
 has been proposed\, and collective inference based on network connectivi
 ty is adopted for prediction. However\, connections in social media are 
 often multi-dimensional. An actor can connect to another actor for diffe
 rent reasons\, e.g.\, alumni\, colleagues\, living in the same city\, sh
 aring similar interests\, etc. Collective inference normally does not di
 fferentiate these connections.  In this work\, we propose to extract lat
 ent social dimensions based on network information\, and then utilize th
 em as features for   discriminative learning. These social dimensions de
 scribe diverse affiliations of actors hidden in the network\, and the di
 scriminative learning can automatically determine which affiliations are
  better   aligned with the class labels.  Such a scheme is preferred whe
 n multiple diverse relations are associated with the same network. We co
 nduct extensive experiments on social media data (one from a real-world 
 blog site and the other from a popular content sharing site). Our model 
 outperforms representative relational learning methods based on collecti
 ve inference\, especially when few labeled data are available. The sensi
 tivity of this model and its connection to existing methods are also exa
 mined.
SUMMARY:Relational Learning via Latent Social Dimensions
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T165000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T162500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4997
DESCRIPTION:A major source of information (often the most crucial and inf
 ormative part) in scholarly articles from scientific journals\, proceedi
 ngs and books are the figures that directly provide images and other gra
 phical illustrations of key experimental results and other scientific co
 ntents. In biological articles\, a typical figure often comprises multip
 le panels\, accompanied by either scoped or global captioned text. Moreo
 ver\, the text in the caption contains important semantic entities such 
 as protein names\, gene ontology\, tissues labels\, etc.\, relevant to t
 he images in the figure. Due to the avalanche of biological literature i
 n recent years\, and increasing popularity of various bio-imaging techni
 ques\, automatic retrieval and summarization of biological information f
 rom literature figures has emerged as a major unsolved challenge in comp
 utational knowledge extraction and management in the life science. We pr
 esent a new structured probabilistic topic model built on a realistic fi
 gure generation scheme to model the structurally annotated biological fi
 gures\, and we derive an efficient inference algorithm based on collapse
 d Gibbs sampling for information retrieval and visualization. The result
 ing program constitutes one of the key IR engines in our SLIF system tha
 t has recently entered the final round (4 out 70 competing systems) of t
 he Elsevier Grand Challenge on Knowledge Enhancement in the Life Science
 . Here we present various evaluations on a number of data mining tasks t
 o illustrate our method.
SUMMARY:Structured Correspondence Topic Models for Mining Captioned Figur
 es in Biological Literature
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T121500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4998
DESCRIPTION:One-Class Collaborative Filtering (OCCF) is a task that natur
 ally emerges in recommender system settings. Typical characteristics inc
 lude: Only positive examples can be observed\, classes are highly imbala
 nced\, and the vast majority of data points are missing. The idea of int
 roducing weights for missing parts of a matrix has recently been shown t
 o help in OCCF. While existing weighting approaches mitigate the first t
 wo problems above\, a sparsity preserving solution that would allow to e
 fficiently utilize data sets with e.g.\, hundred thousands of users and 
 items has not yet been reported. In this paper\, we study three differen
 t collaborative filtering frameworks: Low-rank matrix approximation\, pr
 obabilistic latent semantic analysis\, and maximum-margin matrix factori
 zation. We propose two novel algorithms for large-scale OCCF that allow 
 to weight the unknowns. Our experimental results demonstrate their effec
 tiveness and efficiency on different problems\, including the Netflix Pr
 ize data.
SUMMARY:Mind the Gaps: Weighting the Unknown in Large-Scale One-Class Col
 laborative Filtering
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T152000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T145500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:4999
DESCRIPTION:For a training dataset with a nonexhaustive list of classes\,
  i.e. some classes are not yet known and hence are not represented\, the
  resulting learning problem is ill-defined. In this case a sample from a
  missing class is incorrectly classified to one of the existing classes.
  For some applications the cost of misclassifying a sample could be negl
 igible. However\, the significance of this problem can better be acknowl
 edged when the potentially undesirable consequences of incorrectly class
 ifying a food pathogen as a nonpathogen are considered. Our research is 
 directed towards the real-time detection of food pathogens using optical
 -scattering technology. Bacterial colonies consisting of the progeny of 
 a single parent cell scatter light at 635 nm to produce unique forward-s
 catter signatures. These spectral signatures contain descriptive charact
 eristics of bacterial colonies\, which can be used to identify bacteria 
 cultures in real time. One bottleneck that remains to be addressed is th
 e nonexhaustive nature of the training library. It is very difficult if 
 not impractical to collect samples from all possible bacteria colonies a
 nd construct a digital library with an exhaustive set of scatter signatu
 res.  This study deals with the real-time detection of samples from a mi
 ssing class and the associated problem of learning with a nonexhaustive 
 training dataset. Our proposed method assumes a common prior for the set
  of all classes\, known and missing. The parameters of the prior are est
 imated from the samples of the known classes. This prior is then used to
  generate a large number of samples to simulate the space of missing cla
 sses. Finally a Bayesian maximum likelihood classifier is implemented us
 ing samples from real as well as simulated classes. Experiments performe
 d with samples collected for 28 bacteria subclasses favor the proposed a
 pproach over the state of the art.
SUMMARY:Learning with a Non-exhaustive Training Dataset: Detection of Bac
 teria Cultures using Optical-Scattering Technology
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T112000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T105500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5000
DESCRIPTION:Collaborative filtering is the most popular approach to build
  recommender systems and has been successfully employed in many applicat
 ions. However\, it cannot make recommendations for so-called cold start 
 users that have rated only a very small number of items. In addition\, t
 hese methods do not know how confident they are in their recommendations
 . Trust-based recommendation methods assume the additional knowledge of 
 a trust network among users and can better deal with cold start users\, 
 since users only need to be simply connected to the trust network. On th
 e other hand\, the sparsity of the user item ratings forces the trust-ba
 sed approach to consider ratings of indirect neighbors that are only wea
 kly trusted\, which may decrease its precision. In order to find a good 
 trade-off\, we propose a random walk model combining the trust-based and
  the collaborative filtering approach for recommendation. The random wal
 k model allows us to define and to measure the confidence of a recommend
 ation. We performed an evaluation on the Epinions dataset and compared o
 ur model with existing trust-based and collaborative filtering methods.
SUMMARY:TrustWalker: A Random Walk Model for Combining Trust-based and It
 em-based Recommendation
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T145000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T142500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5001
DESCRIPTION:Labeling text data is quite time-consuming but essential for 
 automatic text classification. Especially\, manually creating multiple l
 abels for each document may become impractical when a very large amount 
 of data is needed for training multi-label text classifiers. To minimize
  the human-labeling efforts\, we propose a novel multi-label active lear
 ning approach which can reduce the required labeled data without sacrifi
 cing the classification accuracy. Traditional active learning algorithms
  can only handle single-label problems\, that is\, each data is restrict
 ed to have one label. Our approach takes into account the multi-label in
 formation\, and select the unlabeled data which can lead to the largest 
 reduction of the expected model loss. Specifically\, the model loss is a
 pproximated by the size of version space\, and the reduction rate of the
  size of version space is optimized with Support Vector Machines (SVM). 
 An effective label prediction method is designed to predict possible lab
 els for each unlabeled data point\, and the expected loss for multi-labe
 l data is approximated by summing up losses on all labels according to t
 he most confident result of label prediction. Experiments on several rea
 l-world data sets (all are publicly available) demonstrate that our appr
 oach can obtain promising classification result with much fewer labeled 
 data than state-of-the-art methods.
SUMMARY:Effective Multi-Label Active Learning for Text Classification
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T112500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T111000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5002
DESCRIPTION:In data publishing\, anonymization techniques such as general
 ization and bucketization have been designed to provide privacy protecti
 on. In the meanwhile\, they reduce the utility of the data. It is import
 ant to consider the tradeoff between privacy and utility. In a paper tha
 t appeared in KDD 2008\, Brickell and Shmatikov proposed an evaluation m
 ethodology by comparing privacy gain with utility gain resulted from ano
 nymizing the data\, and concluded that "even modest privacy gains requir
 e almost complete destruction of the data-mining utility". This conclusi
 on seems to undermine existing work on data anonymization. In this paper
 \, we analyze the fundamental characteristics of privacy and utility\, a
 nd show that it is inappropriate to directly compare privacy with utilit
 y. We then observe that the privacy-utility tradeoff in data publishing 
 is similar to the risk-return tradeoff in financial investment\, and pro
 pose an integrated framework for considering privacy-utility tradeoff\, 
 borrowing concepts from the Modern Portfolio Theory for financial invest
 ment. Finally\, we evaluate our methodology on the Adult dataset from th
 e UCI machine learning repository. Our results clarify several common mi
 sconceptions about data utility and provide data publishers useful guide
 lines on choosing the right tradeoff between privacy and utility.
SUMMARY:On the Tradeoff Between Privacy and Utility in Data Publishing
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T153000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T151500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5003
DESCRIPTION:Many scalable data mining tasks rely on active learning to pr
 ovide the most useful accurately labeled instances. However\, what if th
 ere are multiple labeling sources (`oracles' or `experts') with differen
 t but unknown reliabilities?  With the recent advent of inexpensive and 
 scalable online annotation tools\, such as Amazon's Mechanical Turk\, th
 e labeling process has become more vulnerable to noise - and without pri
 or knowledge of the accuracy of each individual labeler.  This paper add
 resses exactly such a challenge: how to jointly learn the accuracy of la
 beling sources and obtain the most informative labels for the active lea
 rning task at hand minimizing total labeling effort.  More specifically\
 , we present IEThresh (Interval Estimate Threshold) as a strategy to int
 elligently select the expert(s) with the highest estimated labeling accu
 racy.  IEThresh estimates a confidence interval for the reliability of e
 ach expert and filters out  the one(s) whose estimated upper-bound confi
 dence interval is below a threshold - which jointly optimizes expected a
 ccuracy (mean) and need to better estimate the expert's accuracy (varian
 ce).  Our framework is flexible enough to work with a wide range of diff
 erent noise levels and outperforms baselines such as asking all availabl
 e experts and random expert selection. In particular\, IEThresh achieves
  a given level of accuracy with less than half the queries issued by all
 -experts labeling and less than a third the queries required by random e
 xpert selection on datasets such as the UCI mushroom one. The results sh
 ow that our method naturally balances exploration and exploitation as it
  gains knowledge of which experts to rely upon\, and selects them with i
 ncreasing frequency.
SUMMARY:Efficiently Learning the Accuracy of Labeling Sources for Selecti
 ve Sampling
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T105500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5004
DESCRIPTION:The availability and the accuracy of the data dictate the suc
 cess of a data mining application.  Increasingly\, there is a need to re
 sort to on-line data collection to address the problem of data availabil
 ity. However\, participants in on-line data collection applications are 
 naturally distrustful of the data collector as well as their peer respon
 dents\, resulting in inaccurate data collected as the respondents refuse
  to provide truthful data  in fear of collusion attacks. The current ano
 nymity-preserving solutions for on-line data collection are unable to ad
 equately resist such attacks  in a scalable fashion. In this paper\, we 
 present an efficient anonymous data collection protocol for a malicious 
 environment such as the Internet. The protocol employs cryptographic and
  random shuffling techniques to preserve participants' anonymity. The pr
 oposed method is collusion-resistant and guarantees that an attacker wil
 l be unable to breach an honest participant's anonymity unless she contr
 ols all N-1 participants. In addition\, our method is efficient and achi
 eved 15-42% communication overhead reduction in comparison to the prior 
 state-of-the-art methods.
SUMMARY:Collusion-Resistant Anonymous Data Collection Method
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T112000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T105500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5005
DESCRIPTION:In recent years\, the blogosphere has experienced a substanti
 al increase in the number of posts published daily\, forcing users to co
 pe with information overload. The task of guiding users through this flo
 od of information has thus become critical. To address this issue\, we p
 resent a principled approach for picking a set of posts that best covers
  the important stories in the blogosphere. \n\nWe define a simple and el
 egant notion of coverage and formalize it as a submodular optimization p
 roblem\, for which we can efficiently compute a near-optimal solution. I
 n addition\, since people have varied interests\, the ideal coverage alg
 orithm should incorporate user preferences in order to tailor the select
 ed posts to individual tastes. We define the problem of learning a perso
 nalized coverage function by providing an appropriate user-interaction m
 odel and formalizing an online learning framework for this task. We then
  provide a no-regret algorithm which can quickly learn a users preferen
 ces from limited feedback.\n\nWe evaluate our coverage and personalizati
 on algorithms extensively over real blog data. Results from a user study
  show that our simple coverage algorithm does as well as most popular bl
 og aggregation sites\, including Google Blog Search\, Yahoo! Buzz\, and 
 Digg. Furthermore\, we demonstrate empirically that our algorithm can su
 ccessfully adapt to user preferences. We believe that our technique\, es
 pecially with personalization\, can dramatically reduce information over
 load.
SUMMARY:Turning Down the Noise in the Blogosphere
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T165000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T162500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5006
DESCRIPTION:We propose two approximation algorithms for identifying\ncomm
 unities in dynamic social networks. Communities are\nintuitively charact
 erized as unusually densely knit subsets\nof a social network. This no
 tion becomes more problematic\nif the social interactions change over ti
 me. Aggregating\nsocial networks over time can radically misrepresent th
 e\nexisting and changing community structure. Recently\, we\nhave propos
 ed an optimization-based framework for modeling\ndynamic community struc
 ture. Also\, we have proposed\nan algorithm for finding such structure b
 ased on maximum\nweight bipartite matching. In this paper\, we analyze i
 ts performance\nguarantee for a special case where all actors can be\nob
 served at all times. In such instances\, we show that the\nalgorithm is 
 a small constant factor approximation of the\noptimum. We use a similar 
 idea to design an approximation\nalgorithm for the general case where so
 me individuals\nare possibly unobserved at times\, and to show that the 
 approximation\nfactor increases twofold but remains a constant\nregardle
 ss of the input size. This is the first algorithm for\ninferring communi
 ties in dynamic networks with a provable\napproximation guarantee. We de
 monstrate the general algorithm\non real data sets. The results confirm 
 the efficiency\nand effectiveness of the algorithm in identifying dynami
 c\ncommunities.
SUMMARY:Constant-Factor Approximation Algorithms for Identifying Dynamic 
 Communities
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T114500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T112000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5007
DESCRIPTION:We propose a novel latent factor model to accurately predict 
 response for large scale dyadic data in the presence of features. Our ap
 proach is based on a model that predicts response as a multiplicative fu
 nction of row and column latent factors that are estimated through separ
 ate regressions on known row and column features. In fact\, our model pr
 ovides a single unified framework to address both cold and warm start sc
 enarios that are commonplace in practical applications like recommender 
 systems\, online advertising\, web search\, etc. We provide scalable and
  accurate model fitting methods based on Iterated Conditional Mode and M
 onte Carlo EM algorithms. We show our model induces a stochastic process
  on the dyadic space with kernel (covariance) given by a polynomial func
 tion of features. Methods that generalize our procedure to estimate fact
 ors in an online fashion for dynamic applications are also considered. O
 ur method is illustrated on benchmark datasets and a novel content recom
 mendation application that arises in the context of Yahoo! Front Page. W
 e report significant improvements over several commonly used methods on 
 all datasets.
SUMMARY:Regression-based Latent Factor Models
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T113500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T112000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5008
DESCRIPTION:Conjoint analysis is one of the most popular market research 
 methodologies for assessing how customers with heterogeneous preferences
  appraise various objective characteristics in products or services\, wh
 ich provides critical inputs for many marketing decisions\, e.g. optimal
  design of new products and target market selection. Nowadays it becomes
  practical in e-commercial applications to collect millions of samples q
 uickly. However\, the large-scale data sets make traditional conjoint an
 alysis coupled with sophisticated Monte Carlo simulation for parameter e
 stimation computationally prohibitive. In this paper\, we report a succe
 ssful large-scale case study of conjoint analysis on click through strea
 m in a real-world application at Yahoo!. We consider identifying users' 
 heterogenous preferences from millions of click/view events and building
  predictive models to classify new users into segments of distinct behav
 ior pattern. A scalable conjoint analysis technique\, known as tensor se
 gmentation\, is developed by utilizing logistic tensor regression in sta
 ndard partworth framework for solutions. In offline analysis on the samp
 les collected from a random bucket of Yahoo! Front Page Today Module\, w
 e compare tensor segmentation against other segmentation schemes using d
 emographic information\, and study user preferences on article content w
 ithin tensor segments. Our knowledge acquired in the segmentation result
 s also provides assistance to editors in content management and user tar
 geting. The usefulness of our approach is further verified by the observ
 ations in a bucket test launched in Dec. 2008.
SUMMARY:A Case Study of Behavior-driven Conjoint Analysis on Yahoo! Front
  Page Today Module
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T142500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5009
DESCRIPTION:Motivated by structural properties of the Web graph that supp
 ort efficient data structures for in memory adjacency queries\, we study
  the extent to which a large network can be compressed.  Boldi and Vigna
  (WWW 2004)\, showed  that Web graphs can be compressed down to three bi
 ts of storage per edge\; we study the compressibility of social networks
  where again adjacency queries are a fundamental primitive. To this end\
 , we propose simple combinatorial formulations that encapsulate efficien
 t compressibility of graphs. We show that some of the problems are NP-ha
 rd yet admit effective heuristics\, some of which can exploit properties
  of social networks such as link reciprocity. Our extensive experiments 
 show that social networks and the Web graph exhibit vastly different com
 pressibility characteristics.
SUMMARY:On Compressing Social Networks
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T121500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5010
DESCRIPTION:The aim of data mining is to find novel and actionable insigh
 ts in data. However\, most algorithms typically just find a single (poss
 ibly non-novel/actionable) interpretation of the data even though altern
 atives could exist. The problem of finding an alternative to a given ori
 ginal clustering has received little attention in the literature. Curren
 t techniques (including our previous work) are unfocused/unrefined in th
 at they broadly attempt to find an alternative clustering but do not spe
 cify which properties of the original clustering should or should not be
  retained. In this work\, we explore a principled and flexible framework
  in order to find alternative clusterings of the data. The approach is p
 rincipled since it poses a constrained optimization problem\, so its exa
 ct behavior is understood. It is flexible since the user can formally sp
 ecify positive and negative feedback based on the existing clustering\, 
 which ranges from which clusters to keep (or not) to making a trade-off 
 between alternativeness and clustering quality.
SUMMARY:A Principled and Flexible Framework for Finding Alternative Clust
 erings
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T142500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5011
DESCRIPTION:Various online social networks (OSNs) have been developed rap
 idly on the Internet.  Researchers have analyzed different properties of
  such OSNs\, mainly focusing on the formation and evolution of the netwo
 rks as well as the information propagation over the networks.  In knowle
 dge-sharing OSNs\, such as blogs and question answering systems\, issues
  on how users participate in the network and how users ``generate/contri
 bute'' knowledge are vital to the sustained and healthy growth of the ne
 tworks.  However\, related discussions have not been reported in the res
 earch literature.  \n\nIn this work\, we empirically study workloads fro
 m three popular knowledge-sharing OSNs\, including a blog system\, a soc
 ial bookmark sharing network\, and a question answering social network t
 o examine these properties.  Our analysis consistently shows that (1) us
 ers' posting behavior in these networks exhibits strong daily and weekly
  patterns\, but the user active time in these OSNs does not follow expon
 ential distributions\; (2) the user posting behavior in these OSNs follo
 ws stretched exponential distributions instead of power-law distribution
 s\, indicating the influence of a small number of core users cannot domi
 nate the network\; (3) the distributions of user contributions on high-q
 uality and effort-consuming contents in these OSNs have smaller stretch 
 factors for the stretched exponential distribution.  Our study provides 
 insights into user activity patterns and lays out an analytical foundati
 on for further understanding various properties of these OSNs.
SUMMARY:Analyzing Patterns of User Content Generation in Online Social Ne
 tworks
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T120500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T115000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5012
DESCRIPTION:Stability is an important yet under-addressed issue in featur
 e selection from high-dimensional and small sample data. In this paper\,
  we show that stability of feature selection has a strong dependency on 
 sample size. We propose a novel framework for stable feature selection w
 hich first identifies consensus feature groups from subsampling of train
 ing samples\, and then performs feature\nselection by treating each cons
 ensus feature group as a single entity. Experiments on both synthetic an
 d real-world data sets show that an algorithm developed under this frame
 work is effective at alleviating the problem of small sample size and le
 ads to more stable feature selection results and comparable or better ge
 neralization performance than state-of-the-art feature selection algorit
 hms. Synthetic data sets and algorithm source code are available at http
 ://www.cs.binghamton.edu/$\\sim$lyu/KDD09/.
SUMMARY:Consensus Group Stable Feature Selection
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T151500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T145000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5013
DESCRIPTION:Logistic Regression is a well-known classification method tha
 t has been used widely in many applications of data mining\, machine lea
 rning\, computer vision\, and bioinformatics. Sparse logistic regression
  embeds feature selection in the classification framework using the L1-n
 orm regularization\, and is attractive in many applications involving hi
 gh-dimensional data. In this paper\, we propose Lassplore for solving la
 rge-scale sparse logistic regression. Specifically\, we formulate the pr
 oblem as the L1-ball constrained smooth convex optimization\, and propos
 e to solve the problem using the Nesterov's method\, an optimal first-or
 der black-box method for smooth convex optimization. One of the critical
  issues in the use of the Nesterov's method is the estimation of the ste
 p size at each of the optimization iterations. Previous approaches eithe
 r applies the constant step size which assumes that the Lipschitz gradie
 nt is known in advance\, or requires a sequence of decreasing step size 
 which leads to slow convergence in practice. In this paper\, we propose 
 an adaptive line search scheme which allows to tune the step size adapti
 vely and meanwhile guarantees the optimal convergence rate. Empirical co
 mparisons with several state-of-the-art algorithms demonstrate the effic
 iency of the proposed Lassplore algorithm for large-scale problems.
SUMMARY:Large-Scale Sparse Logistic Regression
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T115000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T112500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5014
DESCRIPTION:We present an algorithm\, called the Offset Tree\, for learni
 ng to make decisions in situations where the payoff of only one choice i
 s observed\, rather than all choices.  The algorithm reduces this settin
 g to binary classification\, allowing one to reuse any existing\, fully 
 supervised binary classification algorithm in this partial information s
 etting.  We show that the Offset Tree is an optimal reduction to binary 
 classification.  In particular\, it has regret at most (k-1) times the r
 egret of the binary classifier it uses (where k is the number of choices
 )\, and no reduction to binary classification can do better.  This reduc
 tion is also computationally optimal\, both at training and test time\, 
 requiring just O(log k) work to train on an example or make a prediction
 .\n\nExperiments with the Offset Tree show that it generally performs be
 tter than several alternative approaches.
SUMMARY:The Offset Tree for Learning with Partial Labels
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T121500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5015
DESCRIPTION:Implicit user feedback\, including click-through and subseque
 nt browsing behavior\, is crucial for evaluating and improving the quali
 ty of results returned by search engines. Several recent studies \\cite{
 webrevisitation\, agichtein2006\,bestbet\,foxEvaluating\,whiteVariabilit
 y} have used post-result browsing behavior including the sites visited\,
  the number of clicks\, and the dwell time on site in order to improve t
 he ranking of search results.\n\nIn this paper\, we first study user beh
 avior on sponsored search results  (i.e.\, the advertisements displayed 
 by search engines next to the ordinary\, organic results)\, and compare 
 this behavior to that of organic results. Second\, to exploit post-resul
 t user behavior for better ranking of sponsored results\, we focus on id
 entifying patterns in user behavior and \\emph{predict} expected on-site
  actions in future instances. In particular\, we show how post-result be
 havior depends on various properties of the queries\, advertisement\, si
 tes\, and users\, and build a classifier using properties such as these 
 to predict certain aspects of the user behavior. Additionally\, we devel
 op a  generative model able to simulate observed user activity by utiliz
 ing a mixture of pareto distributions. We conduct experiments based on b
 illions of real navigation trails collected by a major search engine's b
 rowser toolbar.
SUMMARY:Modeling and Predicting User Behavior in Sponsored Search
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T105500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5016
DESCRIPTION:Attribution of climate change to causal factors has been base
 d predominantly on simulations using physical climate models\, which hav
 e inherent limitations in describing such a complex and chaotic system. 
 We propose an alternative\, data centric\, approach that relies on actua
 l measurements of climate observations and human and natural forcing fac
 tors. Specifically\, we develop a novel method to infer causality from s
 patial-temporal data\, as well as a procedure to incorporate extreme val
 ue modeling into our method in order to address the attribution of extre
 me climate events\, such as heatwaves. Our experimental results on a rea
 l world dataset indicate that changes in temperature are not solely acco
 unted for by solar radiance\, but attributed more significantly to CO2 a
 nd other greenhouse gases. Combined with extreme value modeling\, we als
 o show that there has been a significant increase in the intensity of ex
 treme temperatures\, and that such changes in extreme temperature are al
 so attributable to greenhouse gases. These preliminary results suggest t
 hat our approach can offer a useful alternative to the simulation-based 
 approach to climate modeling and attribution\, and provide valuable insi
 ghts from a fresh perspective.
SUMMARY:Spatial-temporal causal modeling for climate change attribution
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T162500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T160000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5017
DESCRIPTION:Graphs or networks can be used to model complex systems. Dete
 cting community structures from large network data is a classic and chal
 lenging task. In this paper\, we propose a novel community detection alg
 orithm\, which utilizes a dynamic process by contradicting the network t
 opology and the topology-based propinquity\, where the propinquity is a 
 measure of the probability for a pair of nodes involved in a coherent co
 mmunity structure. Through several rounds of mutual reinforcement betwee
 n topology and propinquity\, the community structures are expected to na
 turally emerge. The overlapping vertices shared between communities can 
 also be easily identified by an additional simple postprocessing. To ach
 ieve better efficiency\, the propinquity is incrementally calculated. We
  implement the algorithm on a vertex-oriented bulk synchronous parallel(
 BSP) model so that the mining load can be distributed on thousands of ma
 chines. We obtained interesting experimental results on several real net
 work data.
SUMMARY:Parallel Community Detection on Large Networks with Propinquity D
 ynamics
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T112000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T105500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5018
DESCRIPTION:As clustering methods are often sensitive to parameter tuning
 \, obtaining stability in clustering results is an important task. In th
 is work\, we aim at improving clustering stability by attempting to dimi
 nish the influence of algorithmic inconsistencies and enhance the signal
  that comes from the data. We propose a mechanism that takes m clusterin
 gs as input and outputs $m$ clusterings of comparable quality\, which ar
 e in higher agreement with each other. We call our method the Clustering
  Agreement Process (CAP). To preserve the clustering quality\, CAP uses 
 the same optimization procedure as used in clustering. In particular\, w
 e study the stability problem of randomized clustering methods (which us
 ually produce different results at each run). We focus on methods that a
 re based on inference in a combinatorial Markov Random Field (or Comraf\
 , for short) of a simple topology. We instantiate CAP as inference withi
 n a more complex\, bipartite Comraf. We test the resulting system on fou
 r datasets\, three of which are medium-sized text collections\, while th
 e fourth is a large-scale user/movie dataset. First\, in all the four ca
 ses\, our system significantly improves the clustering stability measure
 d in terms of the macro-averaged Jaccard index. Second\, in all the four
  cases our system managed to significantly improve clustering quality as
  well\, achieving the state-of-the-art results. Third\, our system signi
 ficantly improves stability of consensus clustering built on top of the 
 randomized clustering solutions.
SUMMARY:Improving Clustering Stability with Combinatorial MRFs
LOCATION:Miles Davis A\,B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T162500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T160000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5019
DESCRIPTION:Best Application Paper Award Runner-Up\n\n\nMotivation: Data 
 centers are a critical component of modern IT infrastructure but are als
 o among the worst environmental offenders through their increasing energ
 y usage and the resulting large carbon footprints. Efficient management 
 of data centers\, including power management\, networking\, and cooling 
 infrastructure\, is hence crucial to sustainability. In the absence of a
  'first-principles' approach to manage these complex components and thei
 r interactions\, data-driven approaches have become attractive and tenab
 le.\n\nResults: We present a temporal data mining solution to model and 
 optimize performance of data center chillers\, a key component of the co
 oling infrastructure. It helps bridge raw\, numeric\, time-series inform
 ation from sensor streams toward higher level characterizations of chill
 er behavior\, suitable for a data center engineer. To aid in this transd
 uction\, temporal data streams are first encoded into a symbolic represe
 ntation\, next run-length encoded segments are mined to form frequent mo
 tifs in time series\, and finally these metrics are evaluated by their c
 ontributions to sustainability. A key innovation in our application is t
 he ability to intersperse ``don't care'' transitions (e.g.\, transients)
  in continuous-valued time series data\, an advantage we inherit by the 
 application of frequent episode mining to symbolized representations of 
 numeric time series. Our approach provides both qualitative and quantita
 tive characterizations of the sensor streams to the data center engineer
 \, to aid him in tuning chiller operating characteristics. This system i
 s currently being prototyped for a data center managed by HP and experim
 ental results from this application reveal the promise of our approach.
SUMMARY:Sustainable Operation and Management of Data Center Chillers usin
 g Temporal Data Mining
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T153500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T152000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5020
DESCRIPTION:Classification is a core task in knowledge discovery and data
  mining\, and  there has been substantial research effort in developing 
 sophisticated classification models. In a parallel thread\,  recent work
  from the NLP community suggests that for tasks such as natural language
  disambiguation even a simple algorithm can outperform a sophisticated o
 ne\, if it is provided with large quantities of high quality training da
 ta. In those applications\,  training data occurs naturally in text corp
 ora\, and high quality training data sets running into billions of words
  have been reportedly used.\n\nWe explore how we can apply the lessons f
 rom the NLP community to KDD tasks. Specifically\, we investigate how to
  identify data sources that can yield training data at low cost and stud
 y whether the quantity of the automatically extracted training data can 
 compensate for its lower quality. We carry out this investigation for th
 e specific task of inferring whether a search query has\ncommercial inte
 nt. We mine toolbar and click logs to extract queries from sites that ar
 e predominantly commercial (e.g.\, Amazon) and non-commercial (e.g.\, Wi
 kipedia).  We compare the accuracy obtained using such training data aga
 inst  manually labeled training data. Our results show that we can have 
 large accuracy gains using automatically extracted training data at much
  lower cost.
SUMMARY:Improving Classification Accuracy Using Automatically Extracted T
 raining Data
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T153000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T151500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5021
DESCRIPTION:Query result clustering has recently attracted a lot of atten
 tion to provide users with a succinct overview of relevant results. Howe
 ver\, little work has been done on organizing the query results for obje
 ct-level search. Object-level search result clustering is challenging be
 cause we need to support diverse similarity notions over object-specific
  features (such as the price and weight of a product) of heterogeneous d
 omains. To address this challenge\, we propose a hybrid subspace cluster
 ing algorithm called Hydra. Algorithm Hydra captures the user perception
  of diverse similarity notions from millions of Web pages and disambigua
 tes different senses using feature-based subspace locality measures. Our
  proposed solution\, by combining wisdom of crowds and wisdom of data\, 
 achieves robustness and efficiency over existing approaches. We extensiv
 ely evaluate our proposed framework and demonstrate how to enrich user e
 xperiences in object-level search using a real-world product search scen
 arios.
SUMMARY:Query Result Clustering for Object-level Search
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T112000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T105500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5022
DESCRIPTION:The explosion of user-generated content on the Web has led to
  new opportunities and significant challenges for companies\, that are i
 ncreasingly concerned about monitoring the discussion around their produ
 cts. Tracking such discussion on weblogs\, provides useful insight on ho
 w to improve products or market them more effectively. An important comp
 onent of such analysis is to characterize the sentiment expressed in blo
 gs about specific brands and products. Sentiment Analysis focuses on thi
 s task of automatically identifying whether a piece of text expresses a 
 positive or negative opinion about the subject matter. Most previous wor
 k in this area uses prior lexical knowledge in terms of the sentiment-po
 larity of words. In contrast\, some recent approaches treat the task as 
 a text classification problem\, where they learn to classify sentiment b
 ased only on labeled training data. In this paper\, we present a unified
  framework in which one can use background lexical information in terms 
 of word-class associations\, and refine this information for specific do
 mains using any available training examples.  Empirical results on diver
 se domains show that our approach performs better than using background 
 knowledge or training data in isolation\, as well as alternative approac
 hes to using lexical knowledge with text classification.
SUMMARY:Sentiment Analysis of Blogs by Combining Lexical Knowledge with T
 ext Classification
LOCATION:Louis Armstrong A\,B\,C+D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T151500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T145000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5023
DESCRIPTION:Best Application Paper Award Winner\nBehavioral targeting (BT
 ) leverages historical user behavior to select the ads most relevant to 
 users to display. The state-of-the-art of BT derives a linear Poisson re
 gression model from fine-grained user behavioral data and predicts click
 -through rate (CTR) from user history. We designed and implemented a hig
 hly scalable and efficient solution to BT using Hadoop MapReduce framewo
 rk. With our parallel algorithm and the resulting system\, we can build 
 above 450 BT-category models from the entire Yahoo's user base within on
 e day\, the scale that one can not even imagine with prior systems. More
 over\, our approach has yielded 20% CTR lift over the existing productio
 n system by leveraging the well-grounded probabilistic model fitted from
  a much larger training dataset.\n\nSpecifically\, our major contributio
 ns include: (1) A MapReduce statistical learning algorithm and implement
 ation that achieve optimal data parallelism\, task parallelism\, and loa
 d balance in spite of the typically skewed distribution of domain data. 
 (2) An in-place feature vector generation algorithm with linear time com
 plexity O(n) regardless of the granularity of sliding target window. (3)
  An in-memory caching scheme that significantly reduces the number of di
 sk IOs to make large-scale learning practical. (4) Highly efficient data
  structures and sparse representations of models and data to enable fast
  model updates. We believe that our work makes significant contributions
  to solving large-scale machine learning problems of industrial relevanc
 e in general. Finally\, we report comprehensive experimental results\, u
 sing industrial proprietary codebase and datasets.
SUMMARY:Large-Scale Behavioral Targeting
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T171500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5024
DESCRIPTION:Social networks have become a major focus of research in rece
 nt years\, initially directed towards static networks but increasingly\,
  towards dynamic ones.  In this paper\, we investigate how different pre
 -processing decisions and different network forces such as selection and
  influence affect the modeling of dynamic networks. We also present empi
 rical justification for some of the modeling assumptions made in dynamic
  network analysis (e.g.\, first-order Markovian assumption) and develop 
 metrics to measure the alignment between links and attributes under diff
 erent strategies of using the historical network data. We also demonstra
 te the effect of attribute drift\, that is\, the importance of individua
 l attributes in forming links change over time.
SUMMARY:Measuring the Effects of Preprocessing Decisions and Network Forc
 es in Dynamic Network Analysis
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T151500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T145000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5025
DESCRIPTION:To take the first step beyond keyword-based search toward ent
 ity-based search\, suitable token spans ("spots") on documents must be i
 dentified as references to real-world entities from an entity catalog.  
 Several systems have been proposed to link spots on Web pages to entitie
 s in Wikipedia.  They are largely based on local compatibility between t
 he text around the spot and textual metadata associated with the entity.
   Two recent systems exploit inter-label dependencies\, but in limited w
 ays.  We propose a general collective disambiguation approach.  Our prem
 ise is that coherent documents refer to entities from one or a few relat
 ed topics or domains.  We give formulations for the trade-off between lo
 cal spot-to-entity compatibility and measures of global coherence betwee
 n entities.  Optimizing the overall entity assignment is NP-hard. We inv
 estigate practical solutions based on local hill-climbing\, rounding int
 eger linear programs\, and pre-clustering entities followed by local opt
 imization within clusters.  In experiments involving over a hundred manu
 ally-annotated Web pages and tens of thousands of spots\, our approaches
  significantly outperform recently-proposed algorithms.
SUMMARY:Collective Annotation of Wikipedia Entities in Web Text
LOCATION:Louis Armstrong & Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T171500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5026
DESCRIPTION:Malicious Web sites are a cornerstone of Internet criminal ac
 tivities.  As a result\, there has been broad interest in developing sys
 tems to prevent the end user from visiting such sites.  In this paper\, 
 we describe an approach to this problem based on automated URL classific
 ation\, using statistical methods to discover the tell-tale lexical and 
 host-based properties of malicious Web site URLs.  These methods are abl
 e to learn highly predictive models by extracting and automatically anal
 yzing tens of thousands of features potentially indicative of suspicious
  URLs.  The resulting classifiers obtain 95-99% accuracy\, detecting lar
 ge numbers of malicious Web sites from their URLs\, with only modest fal
 se positives.
SUMMARY:Beyond Blacklists: Learning to Detect Malicious Web Sites from Su
 spicious URLs
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T100000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5027
DESCRIPTION:Data mining techniques use score functions to quantify how we
 ll a model fits a given data set.  Parameters are estimated by optimisin
 g the fit\, as measured by the chosen score function\, and model choice 
 is guided by the size of the scores for the different models.  Since dif
 ferent score functions summarise the fit in different ways\, it is impor
 tant to choose a function which matches the objectives of the data minin
 g exercise.  For predictive classification problems\, a wide variety of 
 score functions exist\, including measures such as precision and recall\
 , the F measure\, misclassification rate\, the area under the ROC curve 
 (the AUC)\, and others.  The first four of these require a classificatio
 n threshold to be chosen\, a choice which may not be easy\, or may even 
 be impossible\, especially when the classification rule is to be applied
  in the future.  In contrast\, the AUC does not require the specificatio
 n of a classification threshold\, but summarises performance over the ra
 nge of possible threshold choices.  However\, unfortunately\, and despit
 e the widespread use of the AUC\, it has a previously unrecognised funda
 mental incoherence lying at the core of its definition.  This means that
  using the AUC can lead to poor model choice and unecessary misclassific
 ations.  The AUC is set in context\, its deficiency explained and the im
 plications illustrated - with the bottom line being that the AUC should 
 not be used.  A family of coherent alternative scores is described.  The
  ideas are illustrated with examples from bank loans\, fraud\, face reco
 gnition\, and health screening.
SUMMARY:Mismatched Models\, Wrong Results\, and Dreadful Decisions
LOCATION:La Seine
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T100000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5028
DESCRIPTION:Data mining research has developed many algorithms for variou
 s analysis tasks on large and complex datasets. However\, assessing the 
 significance of data mining results has received less attention. Analyti
 cal methods are rarely available\, and hence one has to use computationa
 lly intensive methods. Randomization approaches based on null models pro
 vide\, at least in principle\, a general approach that can be used to ob
 tain empirical p-values for various types of data mining approaches.  I 
 review some of the recent work in this area\, outlining some of the open
  questions and problems.
SUMMARY:Randomization Methods in Data Mining
LOCATION:La Seine
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T100000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5029
DESCRIPTION:Network science  focuses on relationships between social enti
 ties.  It is used widely in the social and behavioral sciences\, as well
  as in political science\, economics\, organizational science\, and indu
 strial engineering.  The social network perspective has been developed o
 ver the last sixty years by researchers in psychology\, sociology\, and 
 anthropology\, and more recently\, to a lesser extent\, in physics.  Net
 work science is gaining recognition and standing in  the general social 
 and behavioral science communities as the theoretical basis for examinin
 g social structures.  This basis has been clearly defined by many theori
 sts\, and the paradigm convincingly applied to important substantive pro
 blems.  However\, the paradigm requires a new and different set of conce
 pts and analytic tools\, beyond those   provided by standard quantitativ
 e (particularly\, statistical) methods.  These concepts and tools are th
 e topics of this talk.
SUMMARY:Network Science:   An Introduction to Recent Statistical Approach
 es
LOCATION:La Seine
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T190000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T173000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5030
DESCRIPTION:At KDD-2009 in Paris\, a panel on open standards and cloud co
 mputing addressed emerging trends for data mining applications in scienc
 e and industry. This report summarizes the answers from a distinguished 
 group of thought leaders representing key software vendors in the data m
 ining industry.\n\nSupporting open standards and the Predictive Model Ma
 rkup Language (PMML) in particular\, the panel members discuss topics re
 garding the adoption of prevailing standards\, benefits of interoperabil
 ity for business users\, and the practical application of predictive mod
 els.  We conclude  with an assessment of emerging technology trends and 
 the impact that cloud computing will have on applications as well as lic
 ensing models for the predictive analytics industry.
SUMMARY:Panel: Open Standards and Cloud Computing
LOCATION:room tbd
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5031
DESCRIPTION:Many organizations now devote significant fractions of their 
 advertising/outreach budgets to online advertising\; ad-networks like Ya
 hoo!\, Google\, MSN have responded by constructing new kinds of economic
  models and perform the fundamental task of matching the most relevant a
 ds (selected from a large inventory) for a (query\,user) pair in a given
  context. Nearly all of the challenges that arise are substantially data
 - or model-driven (or both). Computational Advertising is a relatively n
 ew scientific sub-discipline at the interesection of large scale search 
 and text analysis\, information retrieval\, statistical modeling\, machi
 ne learning\, optimization and microeconomics that address this match-ma
 king problem and provides unprecedented opportunities to data miners.\n\
 nTopics covered include a comprehensive introduction to several advertis
 ing forms (sponsored search\, contextual adverting\, display advertising
 )\, revenue models (pay-per-click\, pay-per-view\, pay-per-conversion) a
 nd data mining challenges involved\, along with an overview of state-of-
 the-art techniques in the area with a detailed discussion of open proble
 ms. We will cover information retrieval techniques and their limitations
 \; data mining challenges involved in performing ad matching through cli
 ckstream data and challenging optimization issues that arise in display 
 advertising. In particular\, we will cover statistical modeling techniqu
 es for clickstream data and explore/exploit schemes to perform online ex
 periments for better long-term performance using multi-armed bandit sche
 mes. We also discuss the close relationship of techniques used in recomm
 ender systems to our problem but indicate several additional issues that
  needs to be addressed before they become routine in computational adver
 tising.\n\nWe will only assume basic knowledge of statistical methods\, 
 no prior knowledge of online advertising is required. In fact\, the firs
 t hour that provides an introduction to the area would be appropriate fo
 r all registered attendees of KDD 2009. The second half would require fa
 miliarity with basic concepts like regression\, probability distribution
 s and appreciation of issues involved in fitting statistical models to l
 arge scale applications. No prior knowledge of multi-armed bandits would
  be assumed.
SUMMARY:Tutorial T1: Statistical Challenges in Computational Advertising
LOCATION:Louis Armstrong A & B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5032
DESCRIPTION:While SIGKDD has traditionally enjoyed an unusually high qual
 ity of reviewing\, there is no doubt that publishing in SIGKDD (and othe
 r high quality data mining conferences) is very challenging. This is esp
 ecially true for young faculty\, grad students whose primary advisor is 
 not an experienced SIGKDD author\, or people from outside the community 
 (i.e. a biologist or mathematician who has a result that might greatly i
 nterest the data mining community).\n\nIn this tutorial Dr. Keogh will d
 emonstrate some simple ideas to enhance the probability of success in ge
 tting your paper published in a top data mining conference\; and after t
 he work is published\, getting it highly cited.\n\nThese tips and tricks
  are based on 12 years experience as a SIGKDD author and reviewer\, and 
 wisdom solicited from many of the most prolific data mining researchers/
 reviewers.\n\nTopics covered in the tutorial include:\n\n    * Finding t
 he right problems to work on (80% of the battle).\n    * Don't summarize
 \, sell! Writing abstracts that put the reviewer on your side from the s
 tart.\n    * Getting or creating the perfect dataset.\n    * Experiments
  that tell a story.\n    * Making effective and interesting figures.\n  
   * Getting the reviewers on your side.\n    * The top-ten avoidable rea
 sons why papers get rejected from SIGKDD.\n    * Three simple tricks to 
 increase the number of citations to your work.\n\nWhile Dr. Keogh does n
 ot claim to have a “magic bullet” for publishing in SIGKDD\, his sig
 nificant track record of publishing in top data mining venues\, combined
  with extensive (and deliberately uncredited) experience in helping youn
 ger researchers “break-in” to SIGKDD have placed him in a unique pos
 ition to share useful and actionable advice.\n\nWhile writing this tutor
 ial Dr. Keogh\, sought and received advice from many respected data mini
 ng researchers\, their advice is incorporated into this tutorial.
SUMMARY:Tutorial T2: How to do good research\, get it published in SIGKDD
  and get it cited!
LOCATION:Miles Davis C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5033
DESCRIPTION:Numerous real-world datasets are in matrix form\, thus matrix
  algebra\, linear and multilinear\, provides important algorithmic tools
  for analyzing them. The main type of datasets of interest in this tutor
 ial are graphs. Important datasets modeled as graphs include the Interne
 t\, the Web\, social networks (e\,g Facebook\, LinkedIn)\, computer netw
 orks\, biological networks and many more.\n\nWe will discuss how we repr
 esent a graph as a matrix (adjacency matrix\, Laplacian) and the importa
 nt properties of those representations. We will then show how these prop
 erties are used in several important problems\, including node importanc
 e via random walks (Pagerank)\, community detection (METIS\, Cheeger ine
 quality)\, graph isomorphism and graph similarity. Important dimensional
 ity reduction techniques (SVD and random projections) will be discussed 
 in the context of graph mining problems.\n\nFurthermore\, we provide a s
 urvey of the work on the epidemic threshold\, node proximity and center-
 piece subgraphs. State-of-art graph mining tools for analyzing time evol
 ving graphs will also be presented. Throughout the tutorial\, patterns i
 n static and time evolving\, weighted and unweighted real-world graphs w
 ill be presented.\n\nThe target audience are data mining professionals w
 ho wish to know the most important matrix algebra tools\, their applicat
 ions in large graph mining and the theory behind them.\n\nPrerequisites:
  Computer science background (B.Sc or equivalent)\; familiarity with und
 ergraduate linear algebra.\n\nDemos will be presented.
SUMMARY:Tutorial T3: Large Graph-Mining: Power Tools and a Practitioner's
  Guide
LOCATION:La Seine C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5034
DESCRIPTION:The web provides an unprecedented opportunity to evaluate ide
 as quickly using controlled experiments\, also called randomized experim
 ents\, A/B tests (and their generalizations)\, split tests\, and MultiVa
 riable Tests (MVT). Controlled experiments embody the best scientific de
 sign for establishing a causal relationship between changes and their in
 fluence on user-observable behavior. Data Mining and Knowledge Discovery
  techniques can then be used to analyze the data from such experiments. 
 The tutorial will provide a survey and practical guide to running contro
 lled experiments based on the recently published survey article in the D
 ata Mining and Knowledge Discovery Journal\, co-authored with the two of
  the tutorial co-presenters (http://exp-platform.com/dmkd_survey.aspx)\,
  and based on the book “Always Be Testing” co-authored by the 3rd tu
 torial co-presenter (http://www.amazon.com/Always-Be-Testing-Complete-Op
 timizer/dp/0470290633). The book includes use of industry tools\, such a
 s Google Website Optimizer and recently ranked #2 on Amazon’s sales ra
 nk for computers/e-commerce books. The tutorial includes multiple real-w
 orld examples of actual controlled experiments (many with surprising res
 ults)\, a review the theory and the statistics used to plan and analyze 
 such experiments\, and a discussion of the limitations and pitfalls that
  might face experimenters. Demos will be shown of some tools that suppor
 t controlled experiments.\n\nA video of a related talk can be found on t
 he videolectures website:\nhttp://videolectures.net/cikm08_kohavi_pgtce/
 \n\n\nThe shorter version of the DMKD survey paper is now part of the cl
 ass reading for several classes at Stanford University (CS147\, CS376)\,
  USCD (CSE 291)\, and at the University of Washington (CSEP 510).\n\nTop
 ics covered include:\n\n   1. Why online experimentation using controlle
 d experiments is important\n   2. What you need in order to conduct a va
 lid experiment\n   3. Planning and Analysis of basic experiments\n   4. 
 Benefits and limitations of experimentation\n   5. Multivariable experim
 ents: setup\, analysis\, interpretation\, and interactions\n   6. Archit
 ectures\n   7. Using online free and low-cost software services (demos)\
 n   8. Challenges and advanced statistical concepts for experiments\n
SUMMARY:Tutorial T4: Planning\, Running\, and Analyzing Controlled Experi
 ments on the Web
LOCATION:Les Invalides B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5035
DESCRIPTION:In this tutorial\, we give our perspective on the keys to suc
 cess in application of predictive modeling to competitions like KDD Cup 
 and real-life business intelligence projects. We argue that these two mo
 des of applying predictive modeling share many similarities\, but have a
 lso some important differences. We discuss the main success factors in p
 redictive modeling: domain understanding\, statistical acumen\, and appr
 opriate algorithmic approaches. We describe our relevant experiences in 
 the context of three recent predictive modeling competitions where our t
 eam has had success (KDD Cup 2007 and 2008 and INFORMS DM challenge 2008
 ) and two case studies of projects we have led at IBM Research. We also 
 survey some of the recurring challenges and complexities in practical pr
 edictive modeling applications. One key issue is information leakage\, a
 nd we discuss its definition\, influence\, detection and avoidance. We c
 onsider leakage to be the silent killer of many predictive modeling proj
 ects\, and we demonstrate its impact on the competitions\, and discuss t
 he challenges in addressing it in the real-life projects. Other challeng
 es include framing real-life modeling objectives into predictive modelin
 g\, and usefully applying relational learning concepts when modeling "re
 al-life" complex\, relational datasets.
SUMMARY:Tutorial T5: Predictive Modelling in the Wild: Success Factors in
  Data Mining Competitions and Real-World Applications
LOCATION:Les Invalides B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5036
DESCRIPTION:As data types and data structures change to keep up with evol
 ving technologies and applications\, data quality problems too have evol
 ved and become more complex and interwoven. Data streams\, web logs\, Wi
 kipedias\, biomedical applications\, video streams and social networking
  websites generate a mind boggling variety of data types. However\, data
  quality mining\, the use of data mining to manage\, measure and improve
  data quality\, has focused mostly on addressing each category of data g
 litch separately as a static entity.\n\nIn this tutorial we provide a te
 chnical\, KDD-focused account of recent research and developments in dis
 covering and treating complex data anomalies in a broad range of data. I
 n particular\, we highlight new directions in data quality mining: (a) t
 he applicability and effectiveness of the methodologies for various data
  types such as structured\, semi-structured and stream data\, (b) the de
 tection of concomitant data glitches and patterns like the occurrence of
  outliers in data with missing values and duplicates\, or the co-occurre
 nce of missing values and duplicates\, (c) the design of sequential appr
 oaches to data quality mining\, such as workflows composed of a sequence
  of tasks for data quality exploration and analysis. We give an overview
  of past work\, introduce current research in this area including recent
  methods and techniques for discovering complex patterns of anomalies (e
 .g.\, multivariate outliers\, disguised missing values\, combination of 
 different types of noise)\, and highlight new directions and open proble
 ms in data quality mining.\n\nThe tutorial includes extensive case studi
 es and practical examples of mining data quality problems for a variety 
 of large datasets and data types e.g.\, relational\, XML\, data streams.
  We discuss illustrative examples drawn from a variety of domains like C
 RM\, networking\, biology\, and mobility.
SUMMARY:Tutorial T6: New Directions in Data Quality Mining
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5037
DESCRIPTION:A common task in surveillance\, scientific discovery and data
  cleaning involves monitoring routinely collected data for anomalous eve
 nts. Detecting events in univariate time series data can be effectively 
 accomplished using well-established techniques such as Box-Jenkins model
 s\, regression\, and statistical quality control methods. In recent year
 s\, however\, routinely collected data has become increasingly complex. 
 At each time step\, the data collected can consist of multivariate vecto
 rs and/or be spatial in nature. For instance\, healthcare data used in d
 isease surveillance often consists of multivariate patient records or sp
 atially distributed pharmaceutical sales data. Consequently\, new event 
 detection algorithms have been developed that not only consider temporal
  information but also detect spatial patterns and integrate information 
 from multiple spatio-temporal data streams.\n\nThis tutorial will presen
 t algorithms for event detection\, with a focus on algorithms dealing wi
 th multivariate temporal and spatio-temporal data. We will introduce eve
 nt detection by providing a general formulation of the event detection p
 roblem and describing its unique challenges. In the first half of the tu
 torial\, we will cover algorithms for detecting events in both univariat
 e and multivariate temporal data. The second half will present methods f
 or detecting events in spatio-temporal data\, including several recently
  proposed multivariate approaches.
SUMMARY:Tutorial T7: Event Detection
LOCATION:Louis Armstrong C & D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5038
DESCRIPTION:The Web has changed our way of life and the Web 2.0 has chang
 ed our way of perceiving and using the Web. Data analysis is now require
 d in a plethora of applications that aim to enrich the experience of peo
 ple with the Web. We first discuss data mining for the social Web. We el
 aborate on social network analysis and focus on community mining\, then 
 go over to recommendation engines and personalization. We discuss the ch
 allenges that emerged through the shift from the traditional Web to Web 
 2.0. We then focus on two issues - the need to protect Web applications 
 from manipulation and the need to make them adaptive towards change. We 
 first discuss manipulations/attacks in recommender systems and present c
 ounter-measures. We then elaborate on how changes/concept drifts can be 
 dealt with in applications that analyze clickstream data\, monitor topic
 s in news and blogs\, or monitor communities and their evolution.\n\nThi
 s tutorial is aimed at novice researchers that have general background i
 n data mining and are interested in understanding the potential and chal
 lenges pertinent to the social Web. The participants should have a basic
  understanding of recommendation engines\, personalization and text mode
 ling for mining (vector space models). They will learn how basic techniq
 ues are extended and new techniques are designed for mining the Web\, es
 pecially the social Web. They will also learn about issues that are stil
 l open and require further research - research that the tutorial partici
 pants may decide to perform themselves.\n\nOUTLINE\nPART I: Mining the S
 ocial Web [Osmar Zaiane]\nPART II: Recommendations and Personalization i
 n the Social Web [Bamshad Mobasher]\nPART III: Dealing with Evolution in
  the Web [Myra Spiliopoulou]\nPART IV: Mining Web Data Streams [Olfa Nas
 raoui]
SUMMARY:Tutorial T8: Advances in Mining the Web
LOCATION:Miles Davis C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5039
DESCRIPTION:The proliferation of documents available on the Web and on co
 rporate intranets is driving a new wave of text mining research and appl
 ication. Earlier research addressed extraction of information from relat
 ively small collections of well-structured documents such as newswire or
  scientific publications. Text mining from the other corpora such as the
  web requires new techniques drawn from data mining\, machine learning\,
  NLP and IR. Text mining requires preprocessing document collections (te
 xt categorization\, information extraction\, term extraction)\, storage 
 of the intermediate representations\, analysis of these intermediate rep
 resentations (distribution analysis\, clustering\, trend analysis\, asso
 ciation rules\, etc.)\, and visualization of the results. In this tutori
 al we will present the algorithms and methods used to build text mining 
 systems including pre-processing techniques\, supervised leearning (e.g.
 \, CRF)\, entity resolution\, relationship extraction\, unsupervised lea
 rning and machine reading.\n\nThe tutorial will cover the state of the a
 rt in this rapidly growing area of research\, including recent advances 
 in unsupervised methods for extracting facts from text and methods used 
 for web-scale mining. We will also present several real world applicatio
 ns of text mining. Special emphasis will be given to lessons learned fro
 m years of experience in developing real world text mining systems\, inc
 luding how to handle informal texts such as blogs and user reviews and h
 ow to build scalable systems.\n\nThe instructors are Ronen Feldman and L
 yle Ungar. Ronen is an Associate Professor of Information Systems at the
  Business School of the Hebrew University in Jerusalem. He is the founde
 r of the ClearForest text mining corporation\, and the author of the boo
 k "The Text Mining Handbook" published by Cambridge University Press in 
 2007. Lyle is an Associate Professor of Computer and Information Science
  at the University of Pennsylvania.He recently returned from a sabbatica
 l at Google\, where he and a team built what is probably the world’s l
 argest named entity recognition system.
SUMMARY:Tutorial T9: Real World Text Mining
LOCATION:Louis Armstrong A & B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5040
DESCRIPTION:Bioinformatics is an application domain where information is 
 naturally represented in terms of relations between heterogenous objects
 . Modern experimentation and data acquisition techniques allow the study
  of complex interactions in biological systems. This raises interesting 
 challenges because the amount of data is huge\,some information can not 
 be observed\, and measurements may be noisy.\n\n* Using Random Forests t
 o uncover bivariate interactions in high dimensional small data sets. Jo
 rge M. Arevalillo and Hilario Navarro\n* Identification of structurally 
 important amino acids in proteins by graphtheoretic measures. Tammy M.K.
  Cheng\, Yu-En Lu and Pietro Li´o\n* Lift-based search for significant 
 dependencies in dense data sets. Wilhelmiina Hamalainen\n* Finding Optim
 al Parameters for Edit-Distance Based Sequence Classification is NP-Hard
 . Vlado Keselj\, Haibin Liu\, Norbert Zeh\, Christian Blouin and Chris W
 hidden\n* Multi-Class Protein Fold Recognition using Large Margin Logic 
 based Divide and Conquer Learning. Huma Lodhi\, Stephen Muggleton and Mi
 ke J.E. Sternberg\n* Protein Sequence Alignment and Intrinsic Disorder: 
 A Substitution Matrix for an Extended Alphabet. Uros Midic\, A. Keith Du
 nker and Zoran Obradovic\n* Handling missing values and censored data in
  PCA of pharmacological matrices. Jan Ramon and Fabrizio Costa\n* Compar
 ing Graph-based Representations of Protein for Mining Purposes. Rabie Sa
 idi\, Mondher Maddouri and Engelbert M. Nguifo\n* Can we improve on the 
 identification of Transcription Factor Binding Sites? Hugh P. Shanahan
SUMMARY:W01 WORKSHOP: Statistical and Relational Learning and Mining in B
 ioinformatics (StReBio'09)
LOCATION:Saint Michel
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5041
DESCRIPTION:This workshop will discuss the results of the KDD cup 2009. T
 he competition is organized around a large dataset provided by the Frenc
 h telecom company Orange. It is a problem of Customer Relationship Manag
 ement (CRM)\, a key element of modern marketing strategies. Orange offer
 ed the opportunity to work on a large marketing database to predict the 
 propensity of customers to switch provider (churn)\, buy new products or
  services (appetency)\, or buy upgrades or add-ons proposed to them to m
 ake the sale more profitable (up-selling).
SUMMARY:W10 WORKSHOP: KDD-Cup 2009: Fast Scoring on a Large Database (KDD
 cup09)
LOCATION:St Germain des Prés B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5042
DESCRIPTION:The First ACM SIGKDD International Workshop on Knowledge Disc
 overy from Uncertain Data (U'09) is to discuss in depth the challenges\,
  opportunities and techniques on the topic of analyzing and mining uncer
 tain data. The theme of this workshop is to make connections among the r
 esearch areas of probabilistic databases\, probabilistic reasoning\, and
  data mining\, as well as to build bridges among the aspects of models\,
  data\, applications\, novel mining tasks and effective solutions. By ma
 king connections among different communities\, we aim at understanding e
 ach other in terms of scientific foundation as well as commonality and d
 ifferences in research methodology.
SUMMARY:W11 WORKSHOP: The First ACM SIGKDD Workshop on Knowledge Discover
 y from Uncertain Data (U'09)
LOCATION:Les Invalides B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5043
DESCRIPTION:Wide-area sensor infrastructures\, remote sensors\, RFIDs\, a
 nd wireless sensor networks yield massive volumes of disparate\, dynamic
 \, and geographically distributed data. The Sensor-KDD 2009 workshop sol
 icits papers that describe innovative solutions in offline data mining a
 nd/or real-time analysis of sensor or streaming data. Position papers th
 at describe the challenges and requirements for sensor data based knowle
 dge discovery in high-priority application domains\, as well as relevant
  case studies\, are particularly encouraged.
SUMMARY:W02 WORKSHOP: The 3rd International Workshop on Knowledge Discove
 ry from Sensor Data (SensorKDD-2009)
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5044
DESCRIPTION:Computer supported communication and infrastructure are integ
 ral parts of modern economy. Their security is of incredible importance 
 to a wide variety of practical domains ranging from Internet service pro
 viders to the banking industry and e-commerce\, from corporate networks 
 to the intelligence community. Of interest to this workshop are novel kn
 owledge discovery methods addressing this field\, e.g. adaptive\, active
  or anticipatory approaches integrating new types of contents and protoc
 ols. Equally important are innovative applications demonstrating the eff
 ectiveness of data mining in solving real-world security problems.
SUMMARY:W03 WORKSHOP: Workshop on CyberSecurity and Intelligence Informat
 ics (CSI-KDD)
LOCATION:Miles Davis A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5045
DESCRIPTION:The goal of Visual Analytics is to derive insight from massiv
 e\, dynamic\, ambiguous\, and often conflicting data\; detect the expect
 ed and discover the unexpected\; provide timely\, defensible\, and under
 standable assessments\; and communicate the assessment effectively for a
 ction. The goal of this workshop is to raise the awareness of the KDD co
 mmunity for the importance of Visual Analytics and bring together resear
 cher from the underlying fields to bridge the gap between them—to writ
 e a KDD research roadmap on Visual Analytics.
SUMMARY:W04 WORKSHOP: Visual Analytics and Knowledge Discovery (VAKD '09)
LOCATION:Ella Fitzgerald
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5046
DESCRIPTION:Advertising\, especially online advertising\, is growing rapi
 dly and brings about large volumes of data along with challenging data m
 ining problems. Following on the success of ADKDD 2007 and 2008\, ADKDD 
 2009 is to be held in Paris France\, in conjunction with KDD 2009\, to p
 rovide a high-level international forum for the academic community and t
 he industry to present the state of the art of algorithms and applicatio
 ns of advertising.\n\nWe encourage papers that bring up and formalize ne
 w research problems in online advertising\, or propose novel data mining
  techniques for existing problems. We plan to cover (but not restricted 
 to) the following areas: Mining for Ad Relevance and Ranking\; Audience 
 Intelligence & User Modeling\; Content Understanding\; Search Engine Mar
 keting\, Optimization (SEMs\, SEOs) and Other Topics in Advertising. Acc
 epted papers will be achieved in ACM Digital Library and one or two pape
 rs will be recommended to SIGKDD Explorations.\n
SUMMARY:W05 WORKSHOP: The Third International Workshop on Data Mining and
  Audience Intelligence for Advertising (ADKDD)
LOCATION:room tbd
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5047
DESCRIPTION:Social networks research has come a long way since the notabl
 e “six-degree separation” experiment. In recent years\, social netwo
 rk research has advanced significantly\, thanks to the prevalence of the
  online social websites and the availability of a variety of offline lar
 ge-scale social network systems such as collaboration networks. These so
 cial network systems are usually characterized by the complex network st
 ructures and rich accompanying contextual information. Researchers are i
 ncreasingly interested in addressing a wide range of challenges residing
  in these disparate social network systems\, including identifying commo
 n static topological properties and dynamic properties during the format
 ion and evolution of these social networks\, and how contextual informat
 ion can help in analyzing the pertaining social networks. These issues h
 ave important implications on community discovery\, anomaly detection\, 
 trend prediction and can enhance applications in multiple domains such a
 s information retrieval\, recommendation systems\, security and so on.\n
 \n\nThe third SNA-KDD '2009 aims to bring together practitioners and res
 earchers with a specific focus on the emerging trends and industry needs
  associated with the traditional Web\, the social Web\, and other forms 
 of social networking systems.  Both theoretical and experimental submiss
 ions are encouraged. The interesting topics include (1) data mining adva
 nces on the discovery and analysis of communities\, on personalization f
 or solitary activities (like search) and social activities (like discove
 ry of potential friends)\, on the analysis of user behavior in open fora
  (like conventional sites\, blogs and fora) and in commercial platforms 
 (like e-auctions) and on the associated security and privacy-preservatio
 n challenges\; (2) social network modeling\, scalable\, customizable soc
 ial network infrastructure construction\, dynamic growth and evolution p
 atterns identification and discovery using machine learning approaches o
 r multi-agent based simulation.
SUMMARY:W06 WORKSHOP: The 3rd Workshop on Social Network Mining and Analy
 sis (SNA-KDD)
LOCATION:Le Jardin du Luxembourg D & E
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T083000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5048
DESCRIPTION:Human computation is a new research area that studies the pro
 cess of channeling the vast internet population to perform tasks or prov
 ide data towards solving difficult problems that no known computer algor
 ithms can yet solve perfectly and efficiently\, e.g. digitize books\, re
 cognize objects in images and songs\, translate sentences\, summarize ne
 ws articles\, annotate videos etc. The goal of HCOMP 2009 is to bring to
 gether academic and industry researchers in a stimulating discussion of 
 existing human computation applications\, such as Games With A Purpose (
 e.g. the ESP game)\, Mechanical Turk and CAPTCHAs\, and future direction
 s of this new subject area.\n\nIncluded in the workshop are invited talk
 s\, presentations\, posters\, and a demo session where participants are 
 invited to showcase their human computation applications.
SUMMARY:W07 WORKSHOP: Human Computation Workshop (HCOMP 2009)
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5049
DESCRIPTION:This workshop will present recent advances in algorithms and 
 methods using matrix and scientific computing/applied mathematics for mo
 deling and analyzing massive\, high-dimensional\, and nonlinear-structur
 ed data. One main goal of the workshop is to bring together leading rese
 archers on many topic areas (e.g.\, computer scientists\, computational 
 and applied mathematicians) to assess the state-of-the-art\, share ideas
 \, and form collaborations. We also wish to attract practitioners who se
 ek novel ideas for applications.
SUMMARY:W08 WORKSHOP: Data Mining using Matrices and Tensors (DMMT'09)
LOCATION:St Germain des Prés B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5050
DESCRIPTION:The Data Mining Case Studies Workshop and Practice Prize was 
 established to recognize the very best data mining deployments for the y
 ear. Data Mining Case Studies will highlight data mining implementations
  that have been responsible for a significant and measurable improvement
  in business operations\, advanced scientific discoveries\, or provided 
 other benefits to humanity. The best paper will be awarded the Practice 
 Prize. Do you have an outstanding data mining application? This is a uni
 que opportunity to be recognized for your work.
SUMMARY:W09 WORKSHOP: Data Mining Case Studies and Practice Prize (DMCS #
 3)
LOCATION:Louis Armstrong C & D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T145000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5051
DESCRIPTION:Web logs record the primary interaction of users with web pag
 es in general and search engines in particular.  There are two sources f
 or such logs: user trails obtained from toolbars and query/click informa
 tion obtained from search engines.  In this talk we will address the tas
 k of mining  this rich data to improve user experience on the web.  We w
 ill illustrate a few applications\, together with the modeling and algor
 ithmic challenges that stem from these applications.  We will also discu
 ss the privacy issues  that arise in this context.
SUMMARY:Invited Talk - Mining Web Logs: Applications and Challenges
LOCATION:Le Jardin du Luxembourg A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T145000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5052
DESCRIPTION:NASA has some of the largest and most complex data sources in
  the world\, with data sources ranging from the earth sciences\, space s
 ciences\, and massive distributed engineering data sets from commercial 
 aircraft and spacecraft.  This talk will discuss some of the issues and 
 algorithms developed to analyze and discover patterns in these data sets
 .  We will also provide an overview of a large research program in Integ
 rated Vehicle Health Management.  The goal of this program is to develop
  advanced technologies to automatically detect\, diagnose\, predict\, an
 d mitigate adverse events during the flight of an aircraft.  A case stud
 y will be presented on a recent data mining analysis performed to suppor
 t the Flight Readiness Review of the Space Shuttle Mission STS-119.
SUMMARY:Invited Talk - Data Mining at NASA: from Theory to Applications
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T220000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T193000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5053
DESCRIPTION:This demo highlights our efforts in designing a prototype rec
 ommender system for use within IBM Switzerland. Goal of the system is to
  enable the sales and marketing teams in exploring the ever-changing wea
 lth of information about existing and prospective customers\, and to fac
 ilitate the identification of previously undiscovered sales opportunitie
 s. The presented system achieves its goals first by enabling the effecti
 ve data aggregation from diverse sources (financial data\, RSS feeds\, h
 ardware and software install-base\, etc)\, and then by allowing the inte
 ractive exploration of the customer space. Data exploration is achieved 
 through the use of neighborhood graphs and bi-cluster formation techniqu
 es\, that offer both proximity and cluster navigation of the customer/pr
 oduct search space. The demo\, which is developed as a Web 2.0 platform\
 , offers to the users the capability to interact with state-of-the-art a
 nalytics for recommender systems\, and also highlights the significance 
 of exploratory data visualization techniques.
SUMMARY:Demo D01 - Exploratory Recommender Systems for Sales and Marketin
 g
LOCATION:room tbd
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T220000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T193000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5054
DESCRIPTION:There is an emerging focus on real-time data stream analysis 
 on mobile/ubiquitous devices. A wide range of data stream processing app
 lications are targeted to run on mobile handheld devices with limited co
 mputational capabilities such as patient monitoring\, driver monitoring\
 , providing real-time analysis and visualization for emergency calls\, o
 ptimization of logistics for courier pick-up and delivery etc. In this p
 aper\, we present the first generic toolkit for mobile data mining. The 
 Open Mobile Miner (OMM) toolkit is easy to use\, can be deployed on a ra
 nge of mobile devices\, is extensible and can be customized for applicat
 ion specific needs. A video of the system in operation for three differe
 nt settings is available at: http://www.csse.monash.edu.au/~shonali/OMM/
 OMMVideoDemo.asf and can be viewed using Windows Media Player.
SUMMARY:Demo D02 - Open Mobile Miner: A Toolkit for Mobile Data Stream Mi
 ning
LOCATION:room tbd
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T220000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T193000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5055
DESCRIPTION:Web spam\, which refers to any deliberate actions bringing to
  selected web pages an unjustifiable favorable relevance or importance\,
  is one of the major obstacles for high quality information retrieval on
  the web. Most of the existing web spam detection methods are supervised
  that require a large and representative training set of web pages. More
 over\, they often assume some global information such as a large web gra
 ph and snapshots of a large collection of web pages. However\, in many s
 ituations such assumptions may not hold. Recently\, we studied the probl
 em of online web spam detection\, and proposed the notion of spamicity t
 o measure how likely a page is a spam web page [9\, 7]. Spamicity is a m
 ore flexible and user-controllable measure than the traditional supervis
 ed classification methods. We developed e±cient online link spam and te
 rm spam detection methods using spamicity. In this paper\, we present a 
 demonstration of OSD\, an Online Spam Detection system which can efficie
 ntly calculate a spamicity score online for any page on the web.
SUMMARY:Demo D03 - OSD: An Online Web Spam Detection System
LOCATION:room tbd
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T220000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T193000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5056
DESCRIPTION:This paper presents Visalix\, a Web-based interface aimed at 
 facilitating human-computer cooperation in complex data analysis tasks. 
 It implements an interactive visualization paradigm which assists users 
 in matching their domain knowledge with the algorithmic power of data an
 alysis and mining techniques. Visalix integrates a number of Visual Inte
 ractive Learning components for better understanding\, easier interpreti
 ng complex datasets\, and training prediction models.
SUMMARY:Demo D04 - Visalix: A Web Application for Visual Data Analysis an
 d Clustering
LOCATION:room tbd
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T220000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T193000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5057
DESCRIPTION:With the flourishing of various Web applications\, the Intern
 et has become one of the most important means to access news. According 
 to one investigation\, in the population of Internet users\, 78.5% are l
 ooking for news. Unfortunately\, although the Internet provides a platfo
 rm for easily sharing information\, it also brings a fast explosion of t
 he news data. It leads to the fact that people spend more and more time 
 to digest the data. Can we design new ways to help the users quickly und
 erstand and explore the news data? Information retrieval is one way\, bu
 t it is insufficient. In this paper\, we propose a flexible topic-driven
  framework\, namely NewsInsight\, for news exploration. This framework i
 nnovatively integrates a probabilistic topic model\, graphical data anal
 ysis\, and natural language processing. It performs news mining at the t
 opic level and presents news information with topics\, entities (e.g.\, 
 people\, organization\, and events)\, and relations derived from the new
 s data. Based on this framework\, we have developed a system which can h
 elp people to understand and explore news from multiple dimensions. The 
 trial operation of the system has worked at Xinhua News Agency\, one of 
 the biggest news publishers in China. Feedback from users shows that the
  system has achieved its primary objectives.
SUMMARY:Demo D05 - A Flexible Topic-driven Framework for News Exploration
LOCATION:room tbd
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T220000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T193000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5058
DESCRIPTION:This paper presents Model Monitor (M2)\, a Java toolkit for r
 obustly evaluating machine learning algorithms in the presence of changi
 ng data distributions. M2 provides a simple and intuitive framework in w
 hich users can evaluate classifiers under hypothesized shifts in distrib
 ution and therefore determine the best model (or models) for their data 
 under a number of potential scenarios. Additionally\, M2 is fully integr
 ated with the WEKA machine learning environment\, so that a variety of c
 ommodity classifiers can be used if desired.
SUMMARY:Demo D06 - Model Monitor: Tracking Model Performance in the Real 
 World
LOCATION:room tbd
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T220000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T193000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5059
DESCRIPTION:We present SHIFTR\, a system that assists users in making sen
 se of large scale graph data. Making sense of information represented as
  large graphs is a fundamental challenge in many data-intensive domains.
  We suggest the potential of strong synergies between the data mining\, 
 cognitive psychology\, and HCI communities in matching powerful graph mi
 ning tools with insights into how people learn and interact with informa
 tion\, and here we present SHIFTR as one such application. SHIFTR adapts
  the Belief Propagation algorithm to target important sensemaking tasks 
 such as flexibly reorganizing graph entities into multiple groups based 
 on both positive and negative examples. SHIFTR scales linearly with the 
 graph size through its fast algorithm\, novel mList data structure\, and
  externalization of graph meta data.\n\nWe demonstrate SHIFTR’s usage 
 and benefits through real-world sensemaking scenarios using the DBLP dat
 aset that has almost 2 million author-publication relationships.\nA demo
  video of SHIFTR can be downloaded at http://www.cs.cmu.edu/~dchau/shift
 r/shiftr.mov.
SUMMARY:Demo D07 - SHIFTR: A Fast and Scalable System for Ad Hoc Sensemak
 ing of Large Graphs
LOCATION:room tbd
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T220000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T193000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5060
DESCRIPTION:We demonstrate CSAW\, a system for Curating and Searching the
  Annotated Web. CSAW annotates named entities and quantities in Web-scal
 e text corpora\, and\, where confident\, connects these annotations with
  entries in an entity and type catalog such as Wikipedia. The semistruct
 ured catalog\, together with the unstructured corpus\, forms a composite
  database that CSAW can then search using powerful reachability\, proxim
 ity and aggregation primitives. Specifically\, we can look for snippets 
 with mentions of specific entities\, entities of a specified type\, quan
 tities with specified types or  units\, find unions and intersections of
  snippet sets\, and then aggregate evidence from snippet sets into ranke
 d responses. Responses are not page URLs as in standard Web search\, but
  ranked tables where the cells can be entity references\, quantities\, o
 r token snippets. We will show a subset of CSAW’s capabilities\, and d
 escribe the beginnings of a next-generation Web search API that signific
 antly extends the capabilities of APIs provided by popular search engine
 s today.
SUMMARY:Demo D08 - Curating and Searching the Annotated Web
LOCATION:room tbd
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T220000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T193000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5061
DESCRIPTION:Expert finding\, aiming to answer the question: "Who are expe
 rts on topic X?"\, is becoming one of the biggest challenges for informa
 tion management. Much work has been conducted for expert finding. Method
 s based on language model\, topic model\, and random walk have been prop
 osed. However\, little work has studied why people want to find experts.
 \n\nIn this work\, we describe Expert2Bole\, a search tool that offers e
 xpert finding for various purposes. Specifically\, we first employ the l
 earning-to-rank techniques to learn a function for ranking experts. We f
 urther investigate a specific case of why people search experts\, i.e. B
 ole search\, which tries to identify best supervisors in a given field. 
 How to learn a good ranking function for Bole search is a very challengi
 ng issue\, since there would be very limited or even no supervised infor
 mation which can be used to learn the ranking function. We propose a uni
 fied knowledge transfer approach which takes advantage of the expert fin
 ding knowledge to learn the ranking function for Bole search. A prototyp
 e system has been developed for expert finding and Bole search based on 
 the proposed approach. Experiment results show the effectiveness of the 
 proposed approach.
SUMMARY:Demo D09 - Expert2Bólè: From Expert Finding to Bólè Search
LOCATION:room tbd
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T220000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T193000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5062
DESCRIPTION:This demo presents Spam Miner\, an online system designed for
  real-time monitoring and characterization of spam traffic over the Inte
 rnet. Our system is based on high-level abstractions such as spam messag
 e attributes\, spam campaigns and spamming strategies. A campaign is a c
 luster of messages that are generated from a single message template\; c
 ampaign identification is a challenging problem because it has to handle
  spammer evolution\, while seeking for a spam similarity function that c
 ombines different message characteristics and for strategies that effici
 ently process large volumes of spams. Moreover\, spam campaigns need to 
 be identified on-the-fly\, to allow incident response teams and security
  specialists to react to the threat adequately. Spam Miner addresses cam
 paign identification as a data clustering problem and campaigns are iden
 tified dynamically using a novel incremental approach based on the conce
 pt of Frequent Pattern Tree. Spam Miner is being used by NIC.br (Brazili
 an Network Information Center) and mined more than 350 million spam mess
 ages\, detecting meaningful clusters and patterns\, and helping the orga
 nization to better understand the spam problem in Brazil and how the Bra
 zilian Internet infrastructure is being abused by spammers.\n
SUMMARY:Demo D10 - Spam Miner: A Platform for Detecting and Characterizin
 g Spam Campaigns
LOCATION:room tbd
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T181500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T180000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5066
DESCRIPTION:
SUMMARY:Opening Remarks
LOCATION:La Seine
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T184500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T181500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5067
DESCRIPTION:Chair: Gregory Piatetsky-Shapiro\n* Best Paper Awards\, Wei W
 ang\n* Student Travel Awards\, Jure Leskovec\n* SIGKDD Dissertation Awar
 ds\,  Bamshad Mombasher\n* Best Data Mining Case Study award from the DM
 CS workshop\, Gabor Melli\n* KDD Cup Winners\, David Vogel & Isabelle Gu
 yon\n* SIGKDD Service and Innovation Awards\, Robert Grossman\n
SUMMARY:Award Presentations
LOCATION:La Seine
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T190000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T184500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5068
DESCRIPTION:Dynamics of Large Networks 
SUMMARY:Dissertation Award Talk
LOCATION:La Seine
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T193000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T190000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5069
DESCRIPTION:
SUMMARY:Innovation Award Winner Talk
LOCATION:La Seine
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T140000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5070
DESCRIPTION:
SUMMARY:S08: Session Chair
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T220000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T193000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5071
DESCRIPTION:This session will include posters for papers that were presen
 ted earlier on Monday\, and that are scheduled to be presented on Wednes
 day.
SUMMARY:Poster Session #1
LOCATION:Hôtel de Ville
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T220000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T193000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5072
DESCRIPTION:This session will include posters for papers that were presen
 ted earlier in the day (Tuesday).
SUMMARY:Poster Session #2
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T140000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T123000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5073
DESCRIPTION:
SUMMARY:Lunch
LOCATION:Foyer Rives de Seine
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T140000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T120000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5074
DESCRIPTION:* Progress Report by Gregory Piatetsky-Shapiro \n* Vision for
  the Future by new SIGKDD Chair
SUMMARY:Lunch and SIGKDD business meeting
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T181500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T173000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5075
DESCRIPTION:
SUMMARY:KDD Transfer Meeting (SIGKDD organizers only)
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T090500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5076
DESCRIPTION:
SUMMARY:Plenary Keynote #2 - Session Chair
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T090500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090701T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5078
DESCRIPTION:
SUMMARY:Plenary Keynote #3 - Session Chair
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T180000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T173000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5137
DESCRIPTION:The speaker will present the Europe Media Monitor (EMM) famil
 y of applications developed at the Joint Research Centre (JRC) of the Eu
 ropean Commission. EMM consists of four publicly accessible news analysi
 s systems (see http://press.jrc.it/overview.html):\n(1)  NewsBrief – 
  presents the current state of affairs and detects sudden changes in rea
 l time\;\n(2)  MedISys –  is the Medical Information System focusing 
 specifically on health-related news\;\n(3)  NewsExplorer – allows to 
 navigate news over time and across languages\; also gathers information 
 about people and organisations from multilingual news in the course of t
 ime.\n(4)  EMM-Labs –  gives access to various data visualisation and
  advanced text processing tools.\n\nEMM collects more than 80\,000 artic
 les per day from about 2\,200 online news sources (e.g. BBC\, Le Monde) 
 in 43 different languages\, including non-Latin character set languages 
 such as Chinese\, Arabic and Russian.  EMM applications employ robust an
 d efficient techniques using statistics and Language Technology to clust
 er news articles into major news stories\, monitor the development of a 
 story over time and across languages\, extract information about entitie
 s (locations\, persons and organisation) covered in the media\, and more
 . The major objective of EMM is to serve the needs of users in the Europ
 ean Commission and in European Union Member State institutions. However\
 , the service is freely accessible online so that a wide range of other 
 users benefit from the applications: EMM web sites get between one and t
 wo Million hits per day (approximately 30\,000 visitors per day). Additi
 onally\, many users are subscribed to email notifications. For more tech
 nical details and related research publications\, see http://langtech.jr
 c.it.\n\nMijail Kabadjov works at the European Commission’s  Joint Res
 earch Centre (JRC) in Ispra (Italy)\, in the field of multilingual text 
 summarisation and text mining. He joined the JRC from the School of Info
 rmatics of the University of Edinburgh (UK) where he spent two years wit
 h the Language Technology Group developing text mining applications for 
 the biomedical and recruitment domains. He holds a Ph.D. in Computer Sci
 ence from the University of Essex (UK)\, defended with a thesis on gener
 al-purpose anaphora resolution. Before commencing the Ph.D.\, he worked 
 on various projects in industry\, ranging from fraud detection in credit
  card transactions to customer service optimisation of a large manufactu
 ring firm.
SUMMARY:The Europe Media Monitor family of news analysis applications
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T183000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T180000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5138
DESCRIPTION:Organizations can use data mining and advanced analytics to d
 ramatically improve their bottom line in three basic ways\, by (1) strea
 mlining a process\, (2) eliminating the bad\, or (3) improving the good.
   But modern organizations are so effective at their core tasks that dat
 a mining usually results in an iterative\, rather than transformative\, 
 improvement.  This talk will discuss the traits and culture within a bus
 iness unit that tend to turn data mining techincal successes into busine
 ss successes\, by highlighting some projects for some of America's most 
 innovative agencies and corporations.\n\nAntonia de Medinaceli is a Seni
 or Business Analyst at Elder Research\, where she has been involved in a
 ll aspects of the data mining process over the past decade. Antonia has 
 applied data mining technologies to a wide range of projects\, including
  direct marketing\, crime pattern analysis\, credit scoring\, and fraud 
 detection. Her consulting experience is both domestic and international\
 , and she is fluent in her native French and proficient in Spanish.  Ant
 onia is experienced with most of the leading statistical software packag
 es\, and has taught data mining short courses both on concepts and speci
 fic software.  She has degrees in Computer Science and Systems Engineeri
 ng from the University of Virginia.
SUMMARY:Organizational Traits leading to High ROI for Data Mining
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T190000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090629T183000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5139
DESCRIPTION:On-line marketing spend is rapidly growing and relatively poo
 rly understood when compared to many of the traditional marketing channe
 ls. On-line marketing has characteristics that are similar to both indir
 ect (e.g. television\, radio) and direct (e.g. catalog) marketing. The a
 bility to track Web visits through “cookies” all the way through a p
 urchase is analogous to a catalog marketer’s use of source codes to tr
 ace orders back to specific catalog versions. However\, the concept of a
 d impressions that are viewed by an untracked audience is similar to tel
 evision audiences viewing a commercial. The unique nature of on-line mar
 keting and the available data sources leads to a need for unique applica
 tions for managing and optimizing the spend. This talk will present the 
 results from three live tests of the use of data mining and constraint-b
 ased optimization to improve the business results for paid search\, one 
 of the major areas of online marketing. The case studies will cover a la
 rge US retail bank\, a large US e-commerce site\, and a US real estate l
 ead aggregator.\n\nRobert Cooley holds a PhD in Computer Science from th
 e University of Minnesota. Currently\, Dr. Cooley is the Chief Technolog
 y Officer for OptiMine Software\, which specializes in analytic applicat
 ions for on-line marketing. Prior to OptiMine\, Dr. Cooley was a VP of T
 echnical Operations for KXEN Inc.\, a data mining software company. Dr. 
 Cooley is known for his groundbreaking work in Web Mining and has over a
  decade of experience applying data mining to business problems. He has 
 published numerous papers on the topics of CRM Mining\, Web Usage Mining
 \, and Text Mining.
SUMMARY:Improving Online Marketing Performance through Data Mining and Op
 timization
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T180000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T173000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5140
DESCRIPTION:Datamining is expanding it's scope towards the combination of
  hard factual measured data with soft declared surveyed data. This new t
 rend enables us to understand causality of events and enables the taggin
 g of qualitative segmentations in the customer base. However\, both act
 ivities still have alot of challenges\, mainly due to the discrepancy of
  what people say and what they actually do. Perception versus Reality.\n
 \nAlain joined the France Telecom/Orange Group in Jan '04 as Director of
  Customer Insight. Prior to that he worked 5 years at Mobistar\, an affi
 liate of France Telecom in Belgium\, after being 5 years in Belgacom\, t
 he fixed incumbent operator in Belgium. He specialises in Segmentation w
 ith its customer base 'tagging'\, and in Social Network Analysis. His we
 ak points are his relationship with his shaver.
SUMMARY: People aren't always doing what they are saying. Perception vers
 us Reality.
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T183000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T180000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5141
DESCRIPTION:The telecommunications industry has undergone tremendous chan
 ges in recent years. Advances in increased bandwidth capabilities and fu
 nctionality of mobile devices\, along with other factors have resulted i
 n more sophisticated customers\, highly competitive global markets\, new
  “players”\, and the emergence of new data-intensive services. All o
 f these changes are leading to a search for new business models\, partne
 rships\, and open frameworks\, which constitute a departure from the way
  the telecommunications industry had “traditionally” functioned. The
 se factors combined with a tremendous increase in data (not just in traf
 fic\, but also in what is available to and generated by consumers in new
  services) are creating huge opportunities for KDD and tremendous challe
 nges\, not just from a business intelligence perspective\, but also for 
 user modeling applied to new business models and services. In this talk 
 I will discuss how this landscape has evolved and its impact on KDD task
 s\, giving specific examples\, describing technical challenges and point
 ing to the links from research to business applications in current and f
 uture settings. \n\nAlejandro Jaimes is Senior Research Scientist at Tel
 efónica Research in Madrid where he works closely with engineering tea
 ms on applied research solutions on data mining and user modeling from
  a human-centered perspective. Dr. Jaimes obtained his Ph.D. from Columb
 ia University in 2003 and has worked for IBM (USA\, Japan)\, Fuji Xerox 
 (Japan)\, Siemens (USA)\, AT&T Bell Labs (USA)\, and IDIAP Research Inst
 itute-EPFL (Swizterland). He holds several patents and has given numerou
 s invited talks and participated in panels at several international conf
 erences.
SUMMARY: The Telecom Revolution: Where does KDD go from here?
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T190000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090630T183000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5142
DESCRIPTION:Telco operators are now realizing the tremendous value laced 
 in the relationships among their subscribers as revealed by their callin
 g pattern. Social Network Analysis (SNA) goes beyond calling circle anal
 ysis by analyzing also how the calling circle communicates among themsel
 ves. These network patterns reveal who is influential and who experience
 s pressure from other subscribers. This information is critical in deter
 mining which subscribers are worth more marketing investment due to thei
 r potential influence on other subscribers to adopt new products. Also\,
  it reveals who is suddenly more vulnerable to churn even though nothing
  changed in their behavior ! In this talk we will talk about the notions
  of SNA\, a case study based on Rogers Wireless in Canada\, and how the 
 new KXEN module KSN enabled this work.\n\nDr. Edouard Servan-Schreiber i
 s Assistant Director of Advanced Analytics for Europe\, Middle East and 
 Africa. His specialty is to help businesses extract value from their dat
 a and insure that the sophisticated techniques of automated learning are
  serving business needs. Edouard has worked across industries and market
 s within the EMEA region. Among the topics Edouard has actively worked o
 n: clickstream data for customer affinity\, mobile marketing\, pricing o
 ptimization\, early warning in manufacturing reliability\, text mining\,
  and social network analysis. Edouard began practicing artificial intell
 igence and statistical learning models at Carnegie Mellon University for
  his bachelor’s degree\, before going to UC Berkeley for his PhD in Co
 mputer Science. After returning to his native France\, Edouard co-founde
 d newsfutures.com\, a technology startup offering prediction market tech
 nology.
SUMMARY:Social network analysis for telco operators. 
LOCATION:Auditorium
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5144
DESCRIPTION:In 2001 the Internal Revenue Service (IRS) estimated the tax 
 gap\, i.e. the gap between revenue owed and revenue collected\, to be ap
 proximately $345 billion\, of which they were able to recover only $55 b
 illion. It is critical for the government to reduce the tax gap and an i
 mportant process for doing so is audit selection. In this paper\, we pre
 sent a case study where data mining based methods are used to improve th
 e audit selection procedure at the Minnesota Department of Revenue. We d
 escribe the current tax audit selection process\, discuss the data from 
 various sources as well as the issues regarding feature selection\, and 
 explain the data mining techniques employed. On evaluation data\, data m
 ining methods showed an increase of 63.1% in efficiency. We also present
  results from actual field experiments (i.e. results of field audits per
 formed by auditors at the Minnesota Department of Revenue) validating th
 e effectiveness of data mining for audit selection. The impact of this s
 tudy will be a refinement of the current audit selection and tax collect
 ion procedures.
SUMMARY:W09> Data Mining Based Tax Audit Selection: A Case Study from Min
 nesota Department of Revenue
LOCATION:Louis Armstrong C & D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5145
DESCRIPTION:Knowledge Discovery and Data Mining techniques are now\ncommo
 nly used to find novel\, potentially useful\, patterns in data. Most KDD
  applications involve post-hoc analysis of data and are therefore mostly
  limited to the identification of correlations. Recent seminal work on Q
 uasi-Experimental Designs (Jensen\, et al.\, 2008) attempts to identify 
 causal relationships. Controlled experiments are a standard technique us
 ed in multiple fields. Through randomization and proper design\, experim
 ents allow establishing causality scientifically\, which is why they are
  the gold standard in drug tests. In software development\, multiple tec
 hniques are used to define product requirements\; controlled experiments
  provide a way to assess the impact of new features on customer behavior
 . The Data Mining Case Studies workshop calls for describing completed i
 mplementations related to data mining. Over the last three years\, we bu
 ilt an experimentation platform system (ExP) at Microsoft\, capable of r
 unning and analyzing controlled experiments on web sites and services. T
 he goal was to accelerate innovation through trustworthy experimentation
  and to enable a more scientific approach to planning and prioritization
  of features and designs (Foley\, 2008). Along the way\, we ran many exp
 eriments on over a dozen Microsoft properties and had to tackle both tec
 hnical and cultural challenges. We previously surveyed the literature on
  controlled experiments and shared technical challenges (Kohavi\, et al.
 \, 2009). This paper focuses on problems not commonly addressed in techn
 ical papers: cultural challenges\, lessons\, and the ROI of running cont
 rolled experiments.
SUMMARY:W09> Online Experimentation at Microsoft
LOCATION:Louis Armstrong C & D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5146
DESCRIPTION:Original Equipment Manufacturer companies (OEMs) are facing\n
 more and more the challenge to increase the efficiency and reduce the co
 st for the service of their equipment over their lifecycle. A predictive
  maintenance strategy\, where the optimal time to schedule a service vis
 it is forecasted based on the condition of the equipment\, is often prop
 osed as an answer to this challenge. However\, predictive maintenance ap
 proaches are frequently hampered. First\, by the lack of knowledge of th
 e features that gives a good indication of the condition of the equipmen
 t. Second\, by the processing power needed to predict the future evoluti
 on of the features\, which in most cases is not available from the machi
 ne’s processor.\n\nTo overcome these problems\, we propose in this pap
 er to combine two approaches that are currently used separately: data mi
 ning and prognostics.\n\nThe proposed method consists of two steps. Firs
 t\, data mining and reliability estimation techniques are applied to his
 torical data from the field in order to optimally identify the relevant 
 features for the condition of the equipment and the associated threshold
 s. Secondly\, a prediction model is fitted to the live data of the equip
 ment\, collected from customer’s premises\, for predicting the future 
 evolution of these features and forecasting the time interval to the nex
 t maintenance action. To overcome the limited processing power of the ma
 chine’s processor\, this prediction part is computed on a local server
  which is remotely connected to the machine.\n\nThe proposed method prov
 ed always to retrieve\, from the datasets\, the relevant feature to be f
 orecasted. Validation has been done for two different industrial cases.\
 n\nA first prototype of the predictive module is implemented in some cop
 iers and is running in live conditions\, since November 2008\, in order 
 to check the forecast robustness. First results showed that this module 
 offers a very good indication on when part replacement would be required
 \, with some level of uncertainty which decreases over time.\n\nCalculat
 ed business cases showed that this module will be highly beneficial for 
 the company. Savings of approximately €4\,8 million/year worldwide are
  estimated. This estimate was mainly calculated by reducing labour in re
 active service visits and stock costs.
SUMMARY:W09> A Practical Approach to Combine Data Mining and Prognostics 
 for Improved Predictive Maintenance
LOCATION:Louis Armstrong C & D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5147
DESCRIPTION:Advances in medical imaging technology have resulted in a tre
 mendous increase in information density for a given study. This may resu
 lt from increased spatial resolution\, facilitating greater anatomical d
 etail\, or increased contrast resolution allowing evaluation of more sub
 tle structures than previously possible. An increased temporal image acq
 uisition rate also increases the study information content. Finally\, ne
 w technologies enable visualization or quantification of additional tiss
 ue properties or contrast mechanisms.\n\nHowever\, such technological ad
 vances\, while potentially improving the diagnostic benefits of a study 
 typically result in “data overload” overwhelming the ability of radi
 ologists to process this information. This often manifests as increased 
 total study time\, defined as the combination of acquisition\, processin
 g and interpretation times\; even more critically\, the vast increase in
  data does not always translate to improved diagnosis/treatment selectio
 n. This paper describes a related series of clinically motivated data mi
 ning algorithms & products that extract the key\, actionable information
  from the vast amount of imaging data in order to ensure an improvement 
 in patient care (via more accurate/early diagnosis) and a simultaneous r
 eduction in total study time.\n\nIn addition\, these applications yield 
 decreased inter-user variability and more accurate quantitative image-ba
 sed measurements. While each application targets a specific clinical tas
 k\, they share the common methodology of transforming raw imaging data\,
  through knowledge-based data mining algorithms\, into clinically releva
 nt information. This enables users to spend less time interacting with a
 n image volume to extract the clinical information it contains\, while s
 upporting improved diagnostic accuracy by reducing the risk of accidenta
 l oversight of critical information.
SUMMARY:W09> Mining Medical Images
LOCATION:Louis Armstrong C & D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5148
DESCRIPTION:The benefit of data mining for police seems tremendous\, yet 
 only a few limited applications are documented. This paper starts with d
 escribing the implementation problems of police data mining and introduc
 es a new approach that tries to overcome these problems in the form of a
  data mining system with associative memory as the main technique. This 
 technique makes the system easier to use\, allows uncomplicated data han
 dling and supports many different data types. Consequently\, data prepar
 ation becomes easier and results contain more information. A number of D
 utch police forces have already been using this system for several years
  with over 30 users. Since the analytical process within the police is v
 ery knowledge-intensive\, a high level of domain expertise is essential\
 , which makes it harder to find a police data miner with sufficient doma
 in knowledge plus technical skills in the area of databases\, statistics
  and data mining. The police domain also has data quality issues and a v
 ery diverse information need. This is why the system design tries to red
 uce the need for technical skills as much as possible by working with on
 e standard datawarehouse\, techniques that can be configured automatical
 ly and active user guidance. The ease of use is also ensured by integrat
 ing many tools and techniques from statistics\, business intelligence an
 d data mining into one interactive environment that does not require the
  analytical process to be designed beforehand. Instead\, the analysis is
  performed through step by step interaction. This paper discusses the be
 nefit of police data mining\, the design of the system\, a number of pra
 ctical applications\, best practices and success stories. Experiments ha
 ve shown a factor 20 efficiency gain\, a factor 2 prediction accuracy in
 crease\, a 15% drop in crime rate\, and 50% more suspect recognition.
SUMMARY:W09> Data mining for intelligence led policing
LOCATION:Louis Armstrong C & D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5149
DESCRIPTION:Health care research has lead to the development and availabi
 lity of safe and effective drugs to treat debilitating chronic diseases.
  The impact of this advancement has been the ever increasing costs to ob
 tain these medications. Prime Therapeutics\, a pharmacy benefit manager 
 (PBM)\, strives to ameliorate some of the financial burden through the d
 elivery of medications via mail order pharmacy. Mail order pharmacy prov
 ides cost savings to both the insurer and their members\; in addition to
  improving overall customer satisfaction through a convenient home deliv
 ery system. The objective of this project was to apply various Data Mini
 ng techniques\, like Classification\, Clustering and Association analysi
 s\, to member profile and health care claims data\, to identify the link
 s between member characteristics and their mail order behavior. Identify
 ing the individual characteristics influencing mail order acceptance beh
 avior will help Prime Therapeutics in creating target marketing programs
  to improve mail order utilization.
SUMMARY:W09> Identification of Independent Individualistic Predictors of 
 Mail Order Pharmacy Prescription Utilization of Healthcare Claims
LOCATION:Louis Armstrong C & D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5150
DESCRIPTION:We describe the problem of credit risk evaluation of online p
 ersonal loan applicants. Credit risk scoring is implemented within the d
 ata mining universe\, using the stochastic gradient boosting algorithm. 
 Discussion is concentrated around the specificity of the data and proble
 m\, the selection of an appropriate modeling method\, determining driver
 s of the probability of being a good customer\, and estimation of the im
 pact of different predictors on this probability. The synergy of data mi
 ning and spatial techniques is useful for this type of problems.
SUMMARY:W09> Data Mining Approach to Credit Risk Evaluation of Online Per
 sonal Loan Applicants
LOCATION:Louis Armstrong C & D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5151
DESCRIPTION:Nowadays there is a significant amount of data mining work pe
 rformed outside the DBMS. This article discusses recommendations to push
  data mining analysis into the DBMS paying attention to data preprocessi
 ng (i.e. data cleaning\, summarization and transformation)\, which tends
  to be the most time-consuming task in data mining projects. We present 
 a discussion of practical issues and common solutions when transforming 
 and preparing data sets with the SQL language for data mining purposes\,
  based on experience from real-life projects. We then discuss general gu
 idelines to create variables (features) for analysis. We introduce a sim
 ple prototype tool that translates statistical language programs into SQ
 L\, focusing on data manipulation statements. Based on experience from s
 uccessful projects\, we present actual time performance comparisons runn
 ing SQL code inside the DBMS and outside running programs on a statistic
 al package\, obtained from data mining projects in a store\, a bank and 
 a phone company. We highlight which steps in data mining projects are mu
 ch faster in the DBMS\, compared to external servers or workstations. We
  discuss advantages\, disadvantages and concerns from a practical standp
 oint based on users feedback. This article should be useful for data min
 ing practitioners.
SUMMARY:W09> Migration of Data Mining Pre-processing into the DBMS
LOCATION:Louis Armstrong C & D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5152
DESCRIPTION:This study outlines a method of determining individual custom
 er potential\, based solely on data present in the customer database: de
 scriptive information and transaction records. We define potential as th
 e incremental turnover that any particular company could do with their p
 resent customers.In order to successfully calculate this potential in a 
 large database with multiple variables\, we propose grouping together cu
 stomers who “look like each other” (known as clones)\, by means of a
 n appropriate clustering technique: Kohonen Networks. This method is app
 lied to actual data sets\, and various techniques are employed to check 
 the stability of the clusters obtained. Real potential is then determine
 d by means of an empirical approach: practical application to a major Fr
 ench retailer’s database of 5 million customers.
SUMMARY:W09> Estimating Potential Customer Value using a classification t
 echnique to determine customer value
LOCATION:Louis Armstrong C & D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5153
DESCRIPTION:Click Fraud is a challenging problem which some have called t
 he “Achilles Heel” of online advertising. Publishers sign up with Ad
  networks to display ads on their web pages. The Publishers receive a pa
 yout for clicks on these ads. Unfortunately this creates an incentive fo
 r the publisher to generate artificial clicks on ads and essentially pri
 nt their own money. In order to maximize scam effectiveness\, Publishers
  can employ sophisticated methods to cloak their attacks\, including the
  use of distributed networks\, hijacked browsers\, and click fraud softw
 are designed to mimic humans. In this paper we will describe some of the
  attack vectors ranging from malware to “click fraud penetrators”. W
 e will also describe the large-scale data mining technologies employed t
 o detect these programs. We conclude with some reflections on the advers
 arial nature of the field and some strategies for disrupting attacker ev
 olution.
SUMMARY:W09> Clickfraud Bot Detection: High Stakes Adversarial Signal Det
 ection
LOCATION:Louis Armstrong C & D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5154
DESCRIPTION:This paper presents a new method to analyze the link between 
 the probabilities produced by a classification model and the variation o
 f its input values. The goal is to increase the predictive probability o
 f a given class by exploring the possible values of the input variables 
 taken independently. The proposed method is presented in a general frame
 work\, and then detailed for naive Bayesian classifiers. We also demonst
 rate the importance of "lever variables"\, variables which can conceivab
 ly be acted upon to obtain specific results as represented by class prob
 abilities\, and consequently can be the target of specific policies. The
  application of the proposed method to several data sets (data proposed 
 in the PAKDD 2007 challenge and in the KDD Cup 2009) shows that such an 
 approach can lead to useful indicators.
SUMMARY:W09> Correlation Explorations in a Classification Model
LOCATION:Louis Armstrong C & D
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5164
DESCRIPTION:In this paper we consider an application of data mining techn
 ology to the analysis of time series data from a pilot circulating fluid
 ized bed (CFB) reactor. We focus on the problem of the online mass predi
 ction in CFB boilers. We present a framework based on switching regressi
 on models depending on perceived changes in the data. We analyze three a
 lternatives for change detection. Additionally\, a noise canceling and a
  state determination and windowing mechanisms are used for improving the
  robustness of online prediction. We validate our ideas on real data col
 lected from the pilot CFB boiler.
SUMMARY:W02> Handling Outliers and Concept Drift in Online Mass Flow Pred
 iction in CFB Boilers
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5165
DESCRIPTION:It is a consensus among earth scientists that climate change 
 will result in an increased frequency of extreme events (e.g.\, precipit
 ation\, snow). Streamflow forecasts and flood/drought analyses\, given t
 his high variability in the climatic driver (snowpack)\, are vital in th
 e western United States. However\, the ability to produce accurate forec
 asts and analyses is dependent upon the quality (accuracy) of these pred
 ictors (snowpack). Current snowpack datasets are based upon in-situ tele
 metry. Recent satellite deployments offer an alternative remote sensing 
 data source of snowpack. The proposed research will investigate (compare
 ) remote sensing datasets in western U.S. watersheds in which snowpack i
 s the primary driver of streamflow. A comparison is made between snow wa
 ter equivalent (SWE) data from in-situ snowpack telemetry (SNOTEL) sites
  and the advanced microwave scanning radiometer – earth observing syst
 em (AMSR-E) aboard NASA’s Aqua satellite. Principal component techniqu
 es and Singular Value Decomposition are applied to determine similaritie
 s and differences between the datasets and investigate regional snowpack
  behaviors. Given the challenges (including costs\, operation and mainte
 nance) of deploying SNOTEL stations\, the objective of the research is t
 o determine if satellite based remote sensed SWE data provide a comparab
 le option to in-situ datasets. Watersheds investigated include the North
  Platte River\, the Upper Green River\, and the Upper Colorado River. Th
 e time period analyzed is 2003-2008\, due to the recent deployment of th
 e NASA Aqua satellite. Two distinct snow regions were found to behave si
 milarly between both datasets using principal component analysis. Singul
 ar Value Decomposition linked both data products with streamflow in the 
 region and found similar behaviors among datasets. However\, only 11 of 
 the 84 SNOTEL sites investigated correlated at a significance of 90% or 
 greater with its corresponding AMSR-E cell. Also\, when comparing SNOTEL
  data with the corresponding satellite cell\, there was a consistent dif
 ference in the magnitude (Snow Water Equivalent) of the datasets. Finall
 y\, both datasets were utilized and compared in a statistically based st
 reamflow forecast of several gages.
SUMMARY:W02> A Comparison of SNOTEL and AMSR-E Snow Water Equivalent Data
 sets in  Western U.S. Watersheds
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5166
DESCRIPTION:To discover patterns in historical data\, climate scientists 
 have applied various clustering methods with the goal of identifying reg
 ions that share some common climatological behavior. However\, past appr
 oaches are limited by the fact that they either consider only a single t
 ime period (snapshot) of multivariate data\, or they consider only a sin
 gle variable by using the time series data as multi-dimensional feature 
 vector. In both cases\, potentially useful information may be lost. More
 over\, clusters in high-dimensional data space can be difficult to inter
 pret\, prompting the need for a more effective data representation. We a
 ddress both of these issues by employing a complex network (graph) to re
 present climate data\, a more intuitive model that can be used for analy
 sis while also having a direct mapping to the physical world for interpr
 etation. A cross correlation function is used to weight network edges\, 
 thus respecting the temporal nature of the data\, and a community detect
 ion algorithm identifies multivariate clusters. Examining networks for c
 onsecutive periods allows us to study structural changes over time. We s
 how that communities have a climatological interpretation and that distu
 rbances in structure can be an indicator of climate events (or lack ther
 eof). Finally\, we discuss how this model can be applied for the discove
 ry of more complex concepts such as unknown teleconnections or the devel
 opment of multivariate climate indices and predictive insights.
SUMMARY:W02> An Exploration of Climate Data Using Complex Networks 
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5167
DESCRIPTION:Clustering is an established data mining technique for groupi
 ng objects based on similarity. For sensor networks one aims at grouping
  sensor measurements in groups of similar measurements. As sensor networ
 ks have limited resources in terms of available memory and energy\, a ma
 jor task sensor clustering is efficient computation on sensor nodes. As 
 a dominating energy consuming task\, communication has to be reduced for
  a better energy efficiency. Considering memory\, one has to reduce the 
 amount of stored information on each sensor node. For in-network cluster
 ing\, k-center based approaches provide k representatives out of the col
 lected sensor measurements. We propose EDISKCO\, an outlier aware increm
 ental method for efficient detection of k-center clusters. Our novel app
 roach is energy aware and reduces amount of required transmissions while
  producing high quality clustering results. In thorough experiments on s
 ynthetic and real world data sets\, we show that our approach outperform
 s a competing technique in both clustering quality and energy efficiency
 . Thus\, we achieve overall significantly better life times of our senso
 r networks.
SUMMARY:W02> EDISKCO: Energy Efficient Distributed in-Sensor-Network K-ce
 nter Clustering with Outliers
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5168
DESCRIPTION:Monitoring biomass over large geographic regions for seasonal
  changes in vegetation and crop phenology is important for many applicat
 ions. In this paper we a present a novel clustering based change detecti
 on method usingMODIS NDVI time series data. We used well known EM techni
 que to find GMM parameters and Bayesian Information Criteria (BIC) for d
 etermining the number of clusters. KL Divergence measure is then used to
  establish the cluster correspondence across two years (2001 and 2006) t
 o determine changes between these two years. The changes identi ed were 
 further analyzed for understanding phenological events. This preliminary
  study shows interesting relationships between key phenological events s
 uch as onset\, length\, end of growing seasons.
SUMMARY:W02> Phenological Event Detection from Multitemporal Image Data
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5169
DESCRIPTION:Distributed PRocessing in Mobile Environments (DPRiME) is a f
 ramework for processing large data sets across an ad-hoc network. Develo
 ped to address the shortcomings of Googleís MapReduce outside of a full
 y-connected network\, DPRiME separates nodes on the network into a maste
 r and workers\; the master distributes sections of the data to available
  oneshop workers to process in parallel. Upon returning results to its m
 aster\, a worker is assigned an unfinished task. Five data mining classi
 fiers were implemented to process the data: decision trees\, k-means\, k
 -nearest neighbor\, Naive Bayes\, and artificial neural networks. Ensemb
 les were used so the classification tasks could be performed in parallel
 . This framework is well-suited for many tasks because it handles commun
 ications\, node movement\, node failure\, packet loss\, data partitionin
 g\, and result collection automatically. Therefore\, DPRiME allows users
  with little knowledge of networking or distributed systems to harness t
 he processing power of an entire network of single- and multi-hop nodes.
SUMMARY:W02> Mining in a Mobile Environment
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5170
DESCRIPTION:Intra-seasonal changes in the Indian summer monsoon are gener
 ally characterized by its active and break (A&B) states. Existing method
 s for identifying the A&B states using rainfall data rely on subjective 
 thresholds\, ignore temporal dependence in the data\, and disregard inhe
 rent uncertainty in their identi cation. This paper develops a method to
  identify intra-seasonal changes in the monsoon using a hidden Markov mo
 del (HMM) that allows objective classification of the monsoon states. Th
 e method facilitates probabilistic interpretation which is especially us
 eful during the transition period between the two monsoon states. The de
 veloped method can also be used to - (i) identify monsoon states in real
  time\, (ii) forecast rainfall values\, and (iii) generate synthetic dat
 a. Comparisons of the results from the proposed model with those from ex
 isting methods suggest that the new method is a promising for detecting 
 intra-seasonal changes in the Indian summer monsoon.
SUMMARY:W02> On the Identification of Intra-seasonal Changes in the India
 n Summer Monsoon
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5171
DESCRIPTION:In many remote sensing applications it is important to use mu
 ltiple sensors to be able to understand the major spatiotemporal distrib
 ution patterns of an observed phenomenon. A particular remote sensing ap
 plication addressed in this study is estimation of an important property
  of atmosphere\, called Aerosol Optical Depth (AOD). Remote sensing data
  for AOD estimation are collected from ground and satellite-based sensor
 s. Satellite based measurements can be used as attributes for estimation
  of AOD and in this way could lead to better understanding of spatiotemp
 oral aerosol patterns on a global scale. Ground-based AOD estimation is 
 more accurate and is traditionally used as groundtruth information in va
 lidation of satellite-based AOD estimations. In contrast to this traditi
 onal role of ground-based sensors\, a data mining approach allows more a
 ctive use of ground-based measurements as labels in supervised learning 
 of a regression model for AOD estimation from satellite measurements. Co
 nsidering the high operational costs of groundbased sensors\, we are stu
 dying a budget-cut scenario that requires a reduction in a number of gro
 und-based sensors. To minimize loss of information\, the objective is to
  retain sensors that are the most useful as a source of labeled data. Th
 e proposed goodness criterion for the selection is how close the accurac
 y of a regression model built on data from a reduced sensor set is to th
 e accuracy of a model built of the entire set of sensors. We developed a
 n iterative method that removes sensors one by one from locations where 
 AOD can be predicted most accurately using training data from the remain
 ing sites. Extensive experiments on two years of globally distributed AE
 RONET ground-based sensor data provide strong evidence that sensors sele
 cted using the proposed algorithm are more informative than the competin
 g approaches that select sensors at random or that select sensors based 
 on spatial diversity.
SUMMARY:W02> Reduction of Ground-Based Sensor Sites for Spatio-Temporal A
 nalysis of Aerosols
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5172
DESCRIPTION:Current research on data stream classification mainly focuses
  on supervised learning\, in which a fully labeled data stream is needed
  for training. However\, fully labeled data streams are expensive to obt
 ain\, which make the supervised learning approach difficult to be applie
 d to real-life applications. In this paper\, we model applications\, suc
 h as credit fraud detection and intrusion detection\, as a one-class dat
 a stream classification problem. The cost of fully labeling the data str
 eam is reduced as users only need to provide some positive samples toget
 her with the unlabeled samples to the learner. Based on VFDT and POSC4.5
 \, we propose our OcVFDT (One-class Very Fast Decision Tree) algorithm. 
 Experimental study on both synthetic and real-life datasets shows that t
 he OcVFDT has excellent classification performance. Even 80% of the samp
 les in data stream are unlabeled\; the classification performance of OcV
 FDT is still very close to that of VFDT\, which is trained on fully labe
 led stream.
SUMMARY:W02> OcVFDT: One-class Very Fast Decision Tree for One-class Clas
 sification of Data Streams
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5173
DESCRIPTION:In this paper\, we presented a frequent pattern based framewo
 rk for event detection in stream data\, it consists of frequent pattern 
 discovery\, frequent pattern selection and modeling three phases: In the
  first phase\, a MNOE (Mining Non-Overlapping Episode) algorithm is prop
 osed to find the non-overlapping frequent pattern in time series. In the
  frequent pattern selection phase\, we proposed an EGMAMC (Episode Gener
 ated Memory Aggregation Markov Chain) model to help us selecting episode
 s which can describe stream data significantly. Then we defined feature 
 flows to represent the instances of discovered frequent patterns and cat
 egorized the distribution of frequent pattern instances into three categ
 ories according to the spectrum of their feature flows. At last\, we pro
 posed a clustering algorithm EDPA (Event Detection by Pattern Aggregatio
 n) to aggregate strongly correlated frequent patterns together. We argue
  that strongly correlated frequent patterns form events and frequent pat
 terns in different categories can be aggregated to form different kinds 
 of events. Experiments on real-world sensor network datasets demonstrate
  that the proposed MNOE algorithm is more efficient than the existing no
 n-overlapping episode mining algorithm and EDPA performs better when the
  input frequent patterns are maximal\, significant and non-overlapping.
SUMMARY:W02> A Frequent Pattern Based Framework for Event Detection in Se
 nsor Network Stream Data
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5174
DESCRIPTION:In regression problems where the number of predictors exceeds
  the number of observations and the correlation between the predictors i
 s high\, a dimensionality reduction or a variable selection approach is 
 demanded. In this paper we deal with a real application where we want to
  retrieve the physical characteristics of a combustion process from the 
 measurements obtained with a spectroscopic sensor. This application show
 s up a multicollinearity problem but furthermore it is considered an ill
 -posed problem. Guided by this application scenario\, we propose a clust
 ering approach to find out homogeneous subsets of data which are embedde
 d in arbitrary oriented linear manifold. This model is developed under c
 ertain assumptions guided by a priori problem knowledge. The resulting d
 ivision preserves both\, the priori assumptions and the homogeneity in t
 he models. Thereby we break the whole problem in n subproblems improving
  its individual prediction accuracy versus a global solution. We show th
 e obtained improvements in a real application scenario related with esti
 mating the temperature from spectroscopic data in a remote sensing frame
 work.
SUMMARY:W02> Supervised Clustering via Principal Component Analysis in a 
 Retrieval Application
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5175
DESCRIPTION:The clustering validation and clustering interpretation are t
 he two last steps of clustering process. The validation step permits to 
 evaluate the goodness of clustering results using some measures. Valid r
 esults are then generally interpreted and used in cluster analysis. The 
 validity measures are classfied into three categories: unsupervised meas
 ures\, supervised measures and relative measures. Several supervised mea
 sures have been proposed to perform supervised evaluation such as entrop
 y\, purity\, F-measure\, Jaccard coefficient and Rand statistic. General
 ly\, these measures evaluate results according to class labels. However\
 , they are not always able to distinguish interpretable clusters because
  most of them depends on the number of labels. This paper proposes a new
  supervised evaluation measure - called "homogeneity degree"- that permi
 ts to merge the steps of validation and interpretation. Our measure is a
 pplied to a real traffic data set and is used to interpret some traffic 
 situations. Comparison with other evaluation measures shows the performa
 nce of our proposal.
SUMMARY:W02> A Novel Measure for Validating Clustering Results Applied to
  Road Traffic
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5176
DESCRIPTION:Skyline queries have gained attention for supporting multicri
 teria analysis of large-scale datasets. While a lot of skyline algorithm
 s have been proposed\, most of the algorithms build upon pre-computed in
 dex structures that cannot generally be supported over sensor data of dy
 namically changing attribute values. We aim to design a scalable non-ind
 ex skyline computation algorithm for sensor data. More specifically\, we
  propose Algorithm SkyTree constructing a dynamic lattice that divides a
  specific region into several subregions based on a pivot point maximizi
 ng dominance region. Such structure enables to perform region-wise domin
 ance tests\, which eliminates unnecessary point-wise dominance tests. In
  addition\, we ensure the progressiveness that has not been supported by
  any non-index algorithm\, where we can identify k points maximizing the
  sum of dominance regions as the greedy approximation method. The k poin
 ts are used to reduce communication cost between sensors in computing gl
 obal skyline. Our evaluation results validate the efficiency of Algorith
 m SkyTree\, both in terms of response time and communication overhead\, 
 over existing algorithms.
SUMMARY:W02> SkyTree: Scalable Skyline Computation for Sensor Data
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5177
DESCRIPTION:In this paper\, a k-means-based clustering method applied to 
 power quality event data is described. The data are collected by the pow
 er quality (PQ) monitors\, which are developed through the National PQ P
 roject and installed on the electricity network. The PQ monitors detect 
 the PQ events defined as voltage sags\, swells\, and interruptions by th
 e IEC Standard 61000-4-30\, and collect the raw data of the event. The p
 roposed method aims to cope with the huge event data size and cluster th
 e event types so that PQ events are ultimately classified. The method he
 lps to manage the event data to come up with PQ assessments for the spec
 ific measurement points and to make comparisons of various measurement p
 oints in terms of PQ of the electricity network.
SUMMARY:W02> Clustering of Power Quality Event Data Collected via Monitor
 ing Systems Installed on the Electricity Network
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5178
DESCRIPTION:The changes in rainfall and temperature patterns over India w
 ere detected using Mann-Kendall trend test\, Bayesian change point analy
 sis\, and a hidden Markov model. A regionalization method was developed 
 to identify homogeneous regions that experience similar weather states. 
 The regionalization helped in nding contiguous regions with strong chang
 e signals. The data were investigated at di erent temporal and spatial r
 esolution to explore the nature of changes. The study found that all Ind
 ia summer monsoon is stable\, but the winter or the north-east monsoon i
 s gradually intensifying. It also detected an abrupt drop in the winter 
 and spring temperature over north-central India and a gradual increase i
 n the summer temperature over the peninsular India. Robustness of the de
 tected changes were evaluated using recent reanalysis dataset.
SUMMARY:W02> Change Detection in Rainfall and Temperature Patterns over I
 ndia
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5179
DESCRIPTION:Knowledge discovery from temporal\, spatial and spatio-tempor
 al data is pivotal for understanding and predicting the behavior of Eart
 hís ecosystem model. An important influence leaving its impact on the e
 cosystem is the global climate system. In this paper\, the Earth Science
  data that we have analyzed consists of daily global air temperature and
  precipitation measurements\, aggregated from heterogeneous sensors for 
 fifty years (1950-1999). The enormous amount of data that is available f
 or analysis requires employment of data mining techniques for discoverin
 g interesting patterns\, detecting significant changes and extracting me
 aningful insights from the data. Our work considers the problem of detec
 ting anomalous (abnormal or unexpected) behavior in the global climate s
 ystem\, discovering teleconnection patterns and providing consequential 
 insights to the analysts.
SUMMARY:W02> Anomaly Detection and Spatio-Temporal Analysis of Global Cli
 mate System
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5180
DESCRIPTION:Climate modeling and analysis of climate change have largely 
 been based on forward simulation with physical models. We propose here a
  data centric approach to climate study based solely on the actual obser
 ved data. This novel approach utilizes a variety of relevant statistical
  modeling and machine learning techniques such as spatial-temporal causa
 l modeling and extreme value modeling\, and suggests multiple future res
 earch directions. We will describe preliminary results using data for No
 rth America from CRU\, NOAA\, NASA\, NCDC\, and CDIAC\, as well as certa
 in technical challenges encountered. It is hoped that this alternative p
 erspective will help uncover new insights\, improve aspects of simulatio
 n models with known uncertainties\, and provide a useful complementary a
 pproach to climate study.
SUMMARY:W02> A data modeling approach to climate change attribution
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5181
DESCRIPTION:Sensor networking is a paradigm getting familiar in space mis
 sions and services. The talk will provide a panoramic view of examples o
 f challenging missions related to earth environment and to space explora
 tion\, where sensor knowledge discovery techniques might become instrume
 ntal to fulfill mission objectives. ESA missions such as GMES (Global Mo
 nitoring for Environment and Security) and the series of possible Mars e
 xploration missions will be presented and put in context with the topic 
 of the workshop.
SUMMARY:W02> Space Missions & Sensor Networking: Challenging Scenarios
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5182
DESCRIPTION:In this talk\, we tackle a fundamental problem that arises wh
 en using sensors to monitor the ecological condition of rivers and lakes
 \, the network of pipes that bring water to our taps\, or the activities
  of an elderly individual when sitting on a chair: Where should we place
  the sensors in order to make effective and robust predictions? Such sen
 sing problems are typically NP-hard\, and in the past\, heuristics witho
 ut theoretical guarantees about the solution quality have often been use
 d. In this talk\, we present algorithms which efficiently find provably 
 near-optimal solutions to large\, complex sensing problems. Our algorith
 ms are based on the key insight that many important sensing problems exh
 ibit submodularity\, an intuitive diminishing returns property: Adding a
  sensor helps more the fewer sensors we have placed so far.  In addition
  to identifying most informative locations for placing sensors\, our alg
 orithms can handle settings\, where sensor nodes need to be able to reli
 ably communicate over lossy links\, where mobile robots are used for col
 lecting data or where solutions need to be robust against adversaries an
 d sensor failures. We present results applying our algorithms to several
  real-world sensing tasks\, including environmental monitoring using rob
 otic sensors\, activity recognition using a built sensing chair\, and a 
 sensor placement competition. We conclude with drawing an interesting co
 nnection between sensor placement for water monitoring and addressing th
 e challenges of information overload on the web.  As examples of this co
 nnection\, we address the problem of selecting blogs to read in order to
  learn about the biggest stories discussed on the web\, and personalizin
 g content to turn down the noise in the blogosphere.
SUMMARY:W02> How Optimized Environmental Sensing Helps Address Informatio
 n Overload on the Web
LOCATION:St Germain des Prés A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T091000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5183
DESCRIPTION: 
SUMMARY:W07> Herd It: Designing A Social Game to Tag Music
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T093500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T092500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5184
DESCRIPTION: 
SUMMARY:W07> KissKissBan: A Competitive Human Computation Game for Image 
 Annotation
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T095500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T094000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5185
DESCRIPTION: 
SUMMARY:W07> Community-based Game Design: Experiments on Social Games for
  Commonsense Data Collection
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5186
DESCRIPTION: 
SUMMARY:W07> A Demonstration Of Human Computation Using The Phrase Detect
 ives Annotation Game
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5187
DESCRIPTION: 
SUMMARY:W07> Picture This: Preferences for Image Search
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5188
DESCRIPTION: 
SUMMARY:W07> Page Hunt: Using Human Computation Games to Improve Web Sear
 ch
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5189
DESCRIPTION: 
SUMMARY:W07> TurKit: Tools for Iterative Tasks on Mechanical Turk
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5190
DESCRIPTION: 
SUMMARY:W07> Search War: A Game for Improving Web Search
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5191
DESCRIPTION: 
SUMMARY:W07> Magic Bullet: A Dual-Purpose Computer Game
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5192
DESCRIPTION: 
SUMMARY:W07> Seaweed: A Web Application for End Users to Design Economic 
 Games
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5193
DESCRIPTION: 
SUMMARY:W07> Thumbs-Up: A Game for Playing to Rank Search Results
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5194
DESCRIPTION: 
SUMMARY:W07> Games for Games: Manipulating Contexts in Human Computation 
 Games
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5195
DESCRIPTION: 
SUMMARY:W07> From Active Towards InterActive Learning: Using Consideratio
 n Information to Improve Labeling Correctness
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5196
DESCRIPTION: 
SUMMARY:W07> TagCaptcha: Annotating images with CAPTCHAs
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5197
DESCRIPTION: 
SUMMARY:W07> CAPTCHA-based Image Labeling on the Soylent Grid\, Peter Fay
 monville
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5198
DESCRIPTION: 
SUMMARY:W07> Designing Crowdsourcing Community for the Enterprise
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5199
DESCRIPTION: 
SUMMARY:W07> A Reputation System for Selling Human Computation
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T111000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5200
DESCRIPTION: 
SUMMARY:W07> The Role of Game Theory in Human Computation Systems
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T113000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T111500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5201
DESCRIPTION: 
SUMMARY:W07> On Formal Models for Social Verification
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T115000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T113500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5202
DESCRIPTION: 
SUMMARY:W07> Efficient Human Computation: the Distributed Labeling Proble
 m
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T121000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T115500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5203
DESCRIPTION: 
SUMMARY:W07> Financial Incentives and the ìPerformance of Crowdsî
LOCATION:La Sorbonne A\, B & C
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T092500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5204
DESCRIPTION: 
SUMMARY:W05> Ad Quality on TV: Predicting Television Audience Retention
LOCATION:Miles Davis B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T095000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T092500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5205
DESCRIPTION: 
SUMMARY:W05> Handling Missing Values in GPS Surveys Using Survival Analys
 is
LOCATION:Miles Davis B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T105500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5206
DESCRIPTION: 
SUMMARY:W05> A Markov Chain Model for Integrating Behavioral Targeting in
 to Contextual Ads
LOCATION:Miles Davis B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T112000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T105500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5207
DESCRIPTION: 
SUMMARY:W05> Probabilistic Latent Semantic User Segmentation for Behavior
 al Targeted Advertising
LOCATION:Miles Davis B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T114500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T112000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5208
DESCRIPTION: 
SUMMARY:W05> Argo: Intelligent Advertising by Mining a User's Interest fr
 om His Photo Collections
LOCATION:Miles Davis B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T121000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T114500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5209
DESCRIPTION: 
SUMMARY:W05> Scalable Clustering and Keyword Suggestion for Online Advert
 isements
LOCATION:Miles Davis B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T121000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5210
DESCRIPTION: 
SUMMARY:W05> Inferring Local Synonyms for Improving Keyword Suggestion in
  an On-line Advertisement System
LOCATION:Miles Davis B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T150000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5211
DESCRIPTION: 
SUMMARY:W05> Invited Talk: Brand Advertising\, On-line Audiences\, and So
 cial Media
LOCATION:Miles Davis B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T152500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T150000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5212
DESCRIPTION: 
SUMMARY:W05> Data-Driven Text Features for Sponsored Search Click Predict
 ion
LOCATION:Miles Davis B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T162500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T160000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5213
DESCRIPTION: 
SUMMARY:W05> Revenue Optimization with Relevance Constraint in Sponsored 
 Search
LOCATION:Miles Davis B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T165000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T162500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5214
DESCRIPTION: 
SUMMARY:W05> Pricing Guidance in Ad Sale Negotiations: The PrintAds Examp
 le
LOCATION:Miles Davis B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T171500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T165000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5215
DESCRIPTION: 
SUMMARY:W05> Online Allocation of Display Advertisements Subject to Advan
 ced Sales Contracts
LOCATION:Miles Davis B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5216
DESCRIPTION:Phishing emails usually contain a message from a credible loo
 king source requesting a user to click a link to a website where she/he 
 is asked to enter a password or other confidential information. Most phi
 shing emails aim at withdrawing money from financial institutions or get
 ting access to private information. Phishing has increased enormously ov
 er the last years and is a serious threat to global security and economy
 . There are a number of possible countermeasures to phishing. These rang
 e from communication-oriented approaches like authentication protocols o
 ver blacklisting to content-based filtering approaches.  We argue that t
 he first two approaches are currently not broadly implemented or exhibit
  deficits. Therefore content-based phishing filters are necessary and wi
 dely used to increase communication security. A number of features are e
 xtracted capturing the content and structural properties of the email. S
 ubsequently a statistical classifier is trained using these features on 
 a training set of emails labeled as ham (legitimate)\, spam or phishing.
  This classifier may then be applied to an email stream to estimate the 
 classes of new incoming emails.  AntiPhish is a specific targeted resear
 ch project funded under Framework Program 6 by the European Union. It is
  aims at developing improved anti-phishing technologies that help to pro
 tect and secure the global email communication infrastructure. The proje
 ct on the one hand developed the filter methodology in a test laboratory
  setting\, but on the other hand implemented this technology in real wor
 ld settings\, to be used to filter all email traffic online in real time
 . In this talk we summarize our experience with phishing filtering with 
 benchmark data and in addition with different real-life email streams.  
 First we describe a number of novel features that are particularly well-
 suited to identify phishing emails [1]. These include statistical models
  for the low-dimensional descriptions of email topics\, sequential analy
 sis of email text and external links\, the detection of embedded logos a
 s well as indicators for hidden salting [2]. Hidden salting is the inten
 tional addition or distortion of content not perceivable by the reader. 
 For empirical evaluation we have obtained a large realistic corpus of em
 ails pre-labeled as spam\, phishing\, and ham (legitimate). In experimen
 ts with benchmark data our methods outperform other published approaches
  for classifying phishing emails.  The second part of the talk describes
  the application of these approaches to real-life email streams. On the 
 one hand we investigate how we can identify new phishing emails arriving
  from a honeypot system. This allows to spot new types of phishing mails
 . Subsequently the characteristics of these new phishing emails can be u
 sed to update client-based phishing filters. A second experiment investi
 gates the capabilities of the AntiPhish system when monitoring emails in
  an ISP framework. It turns out that active learning approaches are very
  efficient to maintain and improve filtering accuracy. We discuss the im
 plications of these results for the practical application of this approa
 ch in the workflow of an email provider. Finally we describe a strategy 
 how the filters may be updated and adapted to new types of phishing.
SUMMARY:W03> Invited Talk "AntiPhish ñ Lessons Learnt"
LOCATION:Miles Davis A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5217
DESCRIPTION:Combining Incremental Hidden Markov Model and Adaboost Algori
 thm for Anomaly Intrusion Detection\,"Chen Yo-Shu\, Chen Yi-Ming"\,"Trad
 itional Hidden Markov Model (HMM) has been successfully applied to anoma
 ly intrusion detection. Incremental HMM (IHMM) further improves the trai
 ning time of HMM. However\, both HMM and IHMM still have the problem of 
 high false positive rate. In this paper\, we propose an Adaboost-IHMM to
  combine IHMM and adaboost for anomaly intrusion detection. As adaboost 
 firstly uses many IHMMs to collectively classify samples then decides th
 e results of samplesí classifications\, the Adaboost-IHMM can improve t
 he accurate rate of classifications. Experimental results with Stide dat
 asets show that the proposed method can significantly improve the false 
 positive rate by 70% without decreasing detection rate. Besides\, we als
 o propose a method to adjust the normal profile for avoiding erroneous d
 etection caused by changes of normal behavior. We perform with experimen
 ts with realistic datasets extracted from the use of popular browsers. C
 ompared with traditional HMM method\, our method can improve the trainin
 g time by 90% to build a new normal profile.
SUMMARY:W03> Combining Incremental Hidden Markov Model and Adaboost Algor
 ithm for Anomaly Intrusion Detection
LOCATION:Miles Davis A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T113000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5218
DESCRIPTION:In network traffic monitoring\, and more particularly in the 
 realm of threat intelligence\, the problem of attack attribution"" refer
 s to the process of effectively attributing new attack events to (un)-kn
 own phenomena\, based on some evidence or traces left on one or several 
 monitoring platforms. Real-world attack phenomena are often largely dist
 ributed on the Internet\, or can sometimes evolve quite rapidly. This ma
 kes them inherently complex and thus difficult to analyze. In general\, 
 an analyst must consider many different attack features (or criteria) in
  order to decide about the plausible root cause of a given attack\, or t
 o attribute it to some given phenomenon. In this paper\, we introduce a 
 global analysis method to address this problem in a systematic way. Our 
 approach is based on a novel combination of a knowledge discovery techni
 que with a fuzzy inference system\, which somehow mimics the reasoning o
 f an expert by implementing a multi-criteria decision-making process bui
 lt on top of the previously extracted knowledge. By applying this method
  on attack traces\, we are able to identify largescale attack phenomena 
 with a high degree of confidence. In most cases\, the observed phenomena
  can be attributed to so-called zombie armies - or botnets\, i.e. groups
  of compromised machines controlled remotely by a same entity. By means 
 of experiments with real-world attack traces\, we show how this method c
 an effectively help us to perform a behavioral analysis of those zombie 
 armies from a long-term\, strategic viewpoint
SUMMARY:W03> Addressing the Attack Attribution Problem using Knowledge Di
 scovery and Multi-criteria Fuzzy Decision-Making
LOCATION:Miles Davis A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T120000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T113000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5219
DESCRIPTION:Commercial anti-virus software are unable to provide protecti
 on against newly launched (a.k.a ìzero-dayî) malware. In this paper\, 
 we propose a novel malware detection technique which is based on the ana
 lysis of byte-level file content. The novelty of our approach\, compared
  with existing content based mining schemes\, is that it does not memori
 ze specific byte-sequences or strings appearing in the actual file conte
 nt. Our technique is non-signature based and therefore has the potential
  to detect previously unknown and zero-day malware. We compute a wide ra
 nge of statistical and information-theoretic features in a block-wise ma
 nner to quantify the byte-level file content. We leverage standard data 
 mining algorithms to classify the file content of every block as normal 
 or potentially malicious. Finally\, we correlate the block-wise classifi
 cation results of a given file to categorize it as benign or malware. Si
 nce the proposed scheme operates at the byte-level file content\; theref
 ore\, it does not require any a priori information about the filetype. W
 e have tested our proposed technique using a benign dataset comprising o
 f six different filetypes ó DOC\, EXE\, JPG\, MP3\, PDF and ZIP and a m
 alware dataset comprising of six different malware types ó backdoor\, t
 rojan\, virus\, worm\, constructor and miscellaneous. We also perform a 
 comparison with existing data mining based malware detection techniques.
  The results of our experiments show that the proposed nonsignature base
 d technique surpasses the existing techniques and achieves more than 90%
  detection accuracy.
SUMMARY:W03> Malware Detection Using Statistical Analysis of Byte-Level F
 ile Content
LOCATION:Miles Davis A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T120000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5220
DESCRIPTION:In adversarial systems\, the performance of a classifier de- 
 creases after it is deployed\, as the adversary learns to defeat it. Rec
 ently\, adversarial data mining was introduced as a solution to this\, w
 here the classification problem is viewed as a game mechanism between an
  adversary and an intelligent and adaptive classifier. Over the last yea
 rs\, phishing fraud through malicious email messages has been a serious 
 threat that affects global security and economy\, where traditional spam
  Filtering techniques have shown to be ineffective. In this domain\, usi
 ng dynamic games of incomplete information\, a game theoretic data minin
 g framework is proposed in order to build an adversary aware classifier 
 for phishing fraud detection. To build the classifier\, an online versio
 n of the Weighted Margin Support Vector Machines with a game theoretic p
 rior knowledge function is proposed. In this pa- per\, a new content-bas
 ed feature extraction technique for phishing filtering is described. Exp
 eriments show that the proposed classifier is highly competitive compare
 d with previously proposed online classification algorithms in this adve
 rsarial environment\, and promising results where obtained using traditi
 onal machine learning techniques over extracted features.
SUMMARY:W03> Online Phishing Classification Using Adversarial Data Mining
  and Signaling Games
LOCATION:Miles Davis A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T144000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5221
DESCRIPTION:Data is a critical resource in numerous organizations. One of
  the challenging problems facing these organizations today is to ensure 
 that only authorized individuals have address to data. Data also has to 
 be protected from malicious corruption. Much of the early work on data s
 ecurity focused on multilevel secure data management systems where users
  have different clearance levels and data has different sensitivity leve
 ls and access to data is governed by the security policies. There were m
 any efforts on securing relational\, distributed and object oriented dat
 abases. More recently\, several aspects of data security are being inves
 tigated including data confidentiality\, integrity\, trust and privacy. 
 Furthermore\, securing data warehouses\, semantic web\, as well as apply
 ing data mining for solving security problems are getting a lot of atten
 tion This presentation will review the developments in data security and
  integrity as well as discuss directions for further research and develo
 pment. In particular\, policy management for the semantic web\, assured 
 information sharing\, privacy preserving data mining and novel ways to b
 uild secure data management systems will be discussed.
SUMMARY:W03> Invited Talk: Data Security and Integrity: Developments and 
 Directions
LOCATION:Miles Davis A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T150500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T144000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5222
DESCRIPTION:While millions of dollars have been invested in information t
 echnologies to improve intelligence information sharing among law enforc
 ement agencies at the Federal\, Tribal\, State and Local levels\, there 
 remains a hesitation to share information between agencies. This lack of
  coordination hinders the ability to prevent and respond to crime and te
 rrorism. Work to date has not produced solutions nor widely accepted par
 adigms for understanding the problem. Therefore\, to enhance the current
  intelligence information sharing services between government entities\,
  in this interdisciplinary research\, we have identified three major are
 as of influence\; Technical\, Social\, and Legal. Furthermore\, we have 
 developed a preliminary model and theory of intelligence information sha
 ring through a literature review\, experience and interviews with practi
 tioners in the field. This model and theory should serve as a basic conc
 eptual framework for further academic work and lead to further investiga
 tion and clarification of the identified factors and the degree of impac
 t they exert on the system so that actionable solutions can be identifie
 d and implemented.
SUMMARY:W03> Towards Trusted Intelligence Information Sharing
LOCATION:Miles Davis A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T153000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T150500
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5223
DESCRIPTION:Intelligence and law enforcement force make use of terrorist 
 and criminal social networks to support their investigations such as ide
 ntifying suspects\, terrorist or criminal subgroups\, and their communic
 ation patterns. Social networks are valuable resources but it is not eas
 y to obtain information to create a complete terrorist or criminal socia
 l network. Missing information in a terrorist or criminal social network
  always diminish the effectiveness of investigation. An individual agenc
 y only has a partial terrorist or criminal social network due to their l
 imited information sources. Sharing and integration of social networks b
 etween different agencies increase the effectiveness of social network a
 nalysis. Unfortunately\, information sharing is usually forbidden due to
  the concern of privacy preservation. In this paper\, we introduce the K
 NN algorithm for subgraph generation and a mechanism to integrate the ge
 neralized information to conduct social network analysis. Generalized in
 formation such as lengths of the shortest paths\, number of nodes on the
  boundary\, and the total number of nodes is constructed for each genera
 lized subgraphs. By utilizing the generalized information shared from ot
 her sources\, an estimation of distance between nodes is developed to co
 mpute closeness centrality. Two experiments have been conducted with ran
 dom graphs and the Global Salafi Jihad terrorist social network. The res
 ult shows that the proposed technique improves the accuracy of closeness
  centrality measures substantially while protecting the sensitive data.
SUMMARY:W03> Social Networks Integration and Privacy Preservation using S
 ubgraph Generalization
LOCATION:Miles Davis A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T163000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T160000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5224
DESCRIPTION:The United States and its Allied Forces have had tremendous s
 uccess in combat operations. This includes combat in Germany\, Japan and
  more recently in Iraq and Afghanistan. However not all of our stabiliza
 tion and reconstruction operations (SARO) have been as successful. Recen
 tly several studies have been carried out on SARO by National Defense Un
 iversity as well as for the Army Science and Technology. One of the majo
 r conclusions is that we need to plan for SARO while we are planning for
  combat. That is\, we cannot start planning for SARO after the enemy reg
 ime has fallen. In addition\, the studies have shown that security\, pow
 er and jobs are key ingredients for success during SARO. It is important
  to give positions to some of the power players from the fallen regime p
 rovided they are trustworthy. It is critical that investments are made t
 o stimulate the local economies. The studies have also analyzed the vari
 ous technologies that are needed for successfully carrying out SARO whic
 h includes sensors\, robotics and information management. In this projec
 t we will focus on the information management component for SARO. As sta
 ted in the work by the Naval Postgraduate School\, we need to determine 
 the social\, political and economic relationships between the local comm
 unities as well as determine who the important people are. This work has
  also identified the 5Ws (Who\, When\, What\, Where and Why) and the (H)
 . To address the key technical challenges for SARO\, we are defining a L
 ife cycle for SARO and subsequently developing a Temporal Geosocial Serv
 ice Oriented Architecture System (TGSSOA) that utilizes Temporal Geosoci
 al Semantic Web (TGS-SW) technologies for managing this lifecycle. We ar
 e developing techniques for representing temporal geosocial information 
 and relationships\, integrating such information and relationships\, que
 rying such information and relationships and finally reasoning about suc
 h information and relationships so that the commander can answer questio
 ns related to the 5Ws and H. To our knowledge we believe that this is th
 e first attempt to develop.
SUMMARY:W03> Design of a Temporal Geosocial Semantic Web for Military Sta
 bilization and Reconstruction Operation
LOCATION:Miles Davis A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T170000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T163000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5225
DESCRIPTION:developed\, used\, and criticized frequently in the recent pa
 st. This paper examines several of the more common criticisms and analyz
 es some factors that bear on whether the criticisms are valid and/or can
  be overcome by appropriate design and use of the data mining applicatio
 n.
SUMMARY:W03> On the Efficacy of Data Mining for Security Applications
LOCATION:Miles Davis A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T170000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5226
DESCRIPTION:It is believed that public companies should have put lots of 
 efforts and resources in designing and implementing effective security p
 olicy in their daily information processing and management against poten
 tial cyber attacks. A company web server accessible by the general publi
 c and attackers is usually a common entry point for cyber attacks. This 
 paper studies and reports the security problems in web servers of public
  companies. We applied several commonly used tools and systems to collec
 t information from publicly accessible web servers of selected public co
 mpanies\, and studied some known security aspects in those public compan
 ies. Our findings will provide an insight to the effectiveness of web se
 rvers in public companies against cyber attacks. This paper also propose
 s a risk analysis tool for cyber attacks\, which is known as pyramid ris
 k analysis tool.
SUMMARY:W03> A Study of Online Service and Information Exposure of Public
  Companies
LOCATION:Miles Davis A
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5227
DESCRIPTION:Random Forests (RF) is an ensemble method which has become wi
 dely accepted within the machine learning and bioinformatics communities
  in the last few years. Its predictive strength\, along with some of the
  ingredients ---rich in information--- provided by the output\, has made
  RF an efficient Data Mining tool for discovering patterns in data. In t
 his paper we review the learning mechanism of RF within the classificati
 on setting and apply it to uncover bivariate interactions\, carrying on 
 useful information about an outcome\, in high dimensional low sample dat
 a. We propose a divide and conquer search strategy in the variable space
  that benefits from the ranking of variable importances of RF at a first
  stage\, along with the out of bag error rate (oob) of the ensemble at a
  second stage. The procedure combines both elements in order to capture 
 difficult to uncover patterns in these type of data. We will show the pe
 rformance of our procedure in some synthetic scenarios and will give a r
 eal application to a microarray data set in order to illustrate how it w
 orks.
SUMMARY:W01> Using Random Forests to uncover bivariate interactions in hi
 gh dimensional small data sets
LOCATION:Saint Michel
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153739Z
UID:5228
DESCRIPTION: 
SUMMARY:W01> Identification of structurally important amino acids in prot
 eins by graph-theoretic measures
LOCATION:Saint Michel
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5229
DESCRIPTION:Dependency analysis is an important but computationally deman
 ding problem in all empirical science. It is especially problematic in b
 ioinformatics\, where data sets are often high dimensional\, dense and/o
 r strongly correlated. As a solution\, we introduce a new algorithm whic
 h searches the most significant association rules expressing positive de
 pendencies. The algorithm uses several effective pruning principles\, wh
 ich enable search without any minimum frequency thresholds. According to
  our initial experiments\, the algorithm suits especially well for typic
 al biological and medical data sets.
SUMMARY:W01> Lift-based search for significant dependencies in dense data
  sets
LOCATION:Saint Michel
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5230
DESCRIPTION:Parametric edit distance based classification has been applie
 d to two significant problems in the bioinformatics area: biological seq
 uence analysis (DNA\, RNA\, protein)\, and semantic relationship extract
 ion from biomedical scientific literature.  This method is based on the 
 edit distance measure on sequences\, with parametric costs for matching\
 , mis-matching\, inserts\, and deletes of letters. We present a proof th
 at finding optimal parameter values for such classification based on tra
 ining data is an NP-hard problem\, which is an important claim to justif
 y the use of heuristic methods for determining the best parameter values
 .
SUMMARY:W01> Finding Optimal Parameters for Edit Distance Based Sequence 
 Classification is NP-Hard
LOCATION:Saint Michel
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T123000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T103000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5231
DESCRIPTION:Inductive Logic Programming (ILP) systems have been successfu
 lly applied to solve complex biological problem by viewing them as binar
 y classification tasks. It remains an open question how an accurate solu
 tion to a multi-class problem can be obtained by using a logic based lea
 rning method. In this paper we present a novel logic based approach to s
 olve complex and challenging multi-class classification problems in bioi
 nformatics by focusing on a particular task\, namely protein fold recogn
 ition. Our technique is based on the use of large margin kernel-based me
 thods in conjunction with first order rules induced by an ILP system. Th
 e proposed  approach learns a multi-class classifier by using a divide a
 nd conquer reduction strategy that splits multi-classes into binary grou
 ps and solves each individual problem recursively hence generating an un
 derlying decision list structure.  The method is applied to assigning pr
 otein domains to folds. Experimental evaluation of the method demonstrat
 es the efficacy of the proposed approach to solving complex multi-class 
 classification problems in bioinformatics.
SUMMARY:W01> Multi-Class Protein Fold Recognition using Large Margin Logi
 c based Divide and Conquer Learning
LOCATION:Saint Michel
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5232
DESCRIPTION:In protein sequence alignment algorithms\, a substitution mat
 rix of 20x20 alignment parameters is used to describe the rates of amino
  acid substitutions over time. Development and evaluation of most substi
 tution matrices including the BLOSUM family [1] was based almost entirel
 y on fully structured proteins. Structurally disordered proteins (i.e. p
 roteins that lack structure\, either in part or as a whole) that have be
 en shown to be very common in nature [2] have a significantly different 
 amino acid composition than ordered (i.e. structured) proteins [3]. Furt
 hermore\, the sequence evolution rate is higher in unstructured as compa
 red to structured regions of proteins containing both structured and uns
 tructured regions [4]. These results cast doubt on appropriateness of th
 e BLOSUM substitution matrices for alignment of structurally disordered 
 proteins [5].To address this problem\, we take into the account the conc
 ept of structural disorder by extending the alphabet for sequence repres
 entation from 20 to 2x20=40 symbols\, 20 for amino acids in disordered r
 egions and 20 for amino acids in ordered regions. A 40x40 substitution m
 atrix is required for alignment of sequences represented in the extended
  alphabet. Such an expanded matrix contains 20x20 submatrices that corre
 spond to matching ordered-ordered\, ordered-disordered\, and disordered-
 disordered pairs of residues. In this paper we describe an iterative pro
 cedure that we used to estimate such a 40x40 substitution matrix. The it
 erative procedure converged with stable results with respect to the choi
 ce of the sequences in the dataset. In the obtained 40x40 matrix we foun
 d substantial differences between the 20x20 submatrices corresponding to
  ordered-ordered\, ordered-disordered\, and disordered-disordered region
  matching. These differences provide evidence that for alignment of prot
 ein sequences that contain disordered segments\, the discovered substitu
 tion matrix is more appropriate than the BLOSUM substitution matrices. A
 t the same time\, the new substitution matrix is applicable for sequence
  alignment of fully ordered proteins as its order-order submatrix is ver
 y similar to a BLOSUM matrix.
SUMMARY:W01> Protein Sequence Alignment and Structural Disorder: A Substi
 tution Matrix for an Extended Alphabet
LOCATION:Saint Michel
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5233
DESCRIPTION:Experimental results often present a substantial fraction of 
 missing and censored values.  Here we propose a strategy to perform prin
 cipal component analysis under this specific incomplete information hypo
 thesis. Finally we show how to reconstruct the missing information in a 
 way consistent with the experimental observations.
SUMMARY:W01> Handling missing values and censored data in PCA of pharmaco
 logical matrices
LOCATION:Saint Michel
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5234
DESCRIPTION:Recently\, the principles of graph theory are being adopted t
 o address molecular and chemical structures investigations such as 3D pr
 otein structure prediction and spatial motifs discovery. Proteins have b
 een parsed into graphs according to several approaches and methods and t
 hen studied based on graph theory concepts and data mining tools. In thi
 s paper we make a brief survey on the most used graph-based representati
 ons and we propose a naÔve method to help with the protein graph making
  since a key step of a valuable protein structure mining process is to b
 uild concise and correct graphs holding reliable information. We\, also\
 , show that some existing and widespread methods present remarkable weak
 nesses and donít really reflect the real protein conformation.
SUMMARY:W01> Comparing Graph-based Representations of Protein for Mining 
 Purposes
LOCATION:Saint Michel
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5235
DESCRIPTION:This paper gives a brief introduction into the biology of Tra
 nscription Factor Bindings Sites (TFBS's) and their importance. It discu
 sses the present methods employed in detecting them\, which are typicall
 y based on single-site frequencies\, their shortcomings and how new data
  is likely to revolutionise the area in the near future.  The challenges
  associated with motif detection for methods that do not employ the stan
 dard approaches  are addressed. Finally\, there is a walk through  of so
 me of the available data sets for those interested in working on this pr
 oblem.
SUMMARY:W01> Can we improve on the identification of Transcription Factor
  Binding Sites?
LOCATION:Saint Michel
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T153000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T152000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5236
DESCRIPTION:Uncertainty is often inherent to data and still there are jus
 t a few data mining algorithms that handle it. In this paper we focus on
  how to account for uncertainty in classification algorithms\, in partic
 ular when data attributes should not be considered completely truthful f
 or classifying a given sample. Our starting point is that each piece of 
 data comes from a potentially different context and\, by estimating cont
 ext probabilities of an unknown sample\, we may derive a weight that qua
 ntifies their influence.  We propose a lazy classification strategy that
  incorporates the uncertainty into both the training and usage of classi
 fiers. We also propose uK-NN\, an extension of the traditional K-NN that
  implements our approach. Finally\, we illustrate uK-NN\, which is curre
 ntly being evaluated experimentally\, using a document classification to
 y example.
SUMMARY:W11> Exploiting Contexts to Deal with Uncertainty in Classificati
 on
LOCATION:Les Invalides B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T152000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T151000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5237
DESCRIPTION:The maximal clique enumeration (MCE) problem can be used to f
 ind very tightly-coupled collections of objects inside a network or grap
 h of relationships. However\, when such networks are based on noisy or u
 ncertain data\, the solutions to the MCE problem for several closely rel
 ated graphs may be necessary to accurately define the collections.  Thus
 \, we propose an algorithm that e ciently solves the MCE problem on alte
 red\, or perturbed\, graphs. The algorithm utilizes the enumeration of a
  baseline graph and identifies only those maximal cliques that the pertu
 rbation adds and/or removes. We detail the algorithm and the underlying 
 theory required to guarantee correctness. Further\, we report average ru
 ntime speedups of 7 and 9 for our algorithm over traditional enumeration
  techniques in the cases of adding and removing edges\, respectively\, f
 rom graphs constructed from protein interaction data.
SUMMARY:W11> On Perturbation Theory and an Algorithm for Maximal Clique E
 numeration in Uncertain and Noisy Graphs
LOCATION:Les Invalides B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T150500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T145000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5238
DESCRIPTION:Mining of frequent patterns is one of the popular knowledge d
 iscovery and data mining (KDD) tasks. It also plays an essential role in
  the mining of many other patterns such as correlation\, sequences\, and
  association rules. Hence\, it has been the subject of numerous studies 
 since its introduction. Most of these studies find all the frequent patt
 erns from collection of precise data\, in which the items within each da
 tum or transaction are definitely known and precise. However\, there are
  many real-life situations in which the user is interested in only some 
 tiny portions of these frequent patterns. Finding all frequent patterns 
 would then be redundant and waste lots of computation. This calls for co
 nstrained mining\, which aims to find only those frequent patterns that 
 are interesting to the user. Moreover\, there are also many real-life si
 tuations in which the data are uncertain. This calls for uncertain data 
 mining. In this paper\, we propose an algorithm to efficiently find cons
 trained frequent patterns from collections of uncertain data.
SUMMARY:W11> Efficient Algorithms for Mining Constrained Frequent Pattern
 s from Uncertain Data
LOCATION:Les Invalides B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T144000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T140500
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5239
DESCRIPTION: 
SUMMARY:W11> Invited Talk "Managing and Mining Uncertain Data: What Might
  We Do Better?"
LOCATION:Les Invalides B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T145500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T144000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5240
DESCRIPTION: 
SUMMARY:W11> Efficient Algorithms for Mining Constrained Frequent Pattern
 s from Uncertain Data
LOCATION:Les Invalides B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T151000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T145500
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5241
DESCRIPTION:
SUMMARY:W11> Identifying Graphs from Noisy and Incomplete Data
LOCATION:Les Invalides B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T163500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T160000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5242
DESCRIPTION: 
SUMMARY:W11> Invited talk "Querying and Mining Uncertain Data: Methods\, 
 Applications\, and Challenges"
LOCATION:Les Invalides B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T165000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T163500
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5243
DESCRIPTION: 
SUMMARY:W11> Learning from Data with Uncertain Labels by Boosting Credal 
 Classifiers
LOCATION:Les Invalides B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T173000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T140000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5244
DESCRIPTION: 
SUMMARY:W11> Using Uncertain Chemical and Thermal Data to Predict Product
  Quality in a Casting Process
LOCATION:Les Invalides B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T171500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T170500
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5245
DESCRIPTION: 
SUMMARY:W11> Lazy Naive Credal Classifier
LOCATION:Les Invalides B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T172500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T171500
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5246
DESCRIPTION: 
SUMMARY:W11> Decision Support and Profit Prediction for Online Auction Se
 llers
LOCATION:Les Invalides B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T094500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T090500
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5247
DESCRIPTION: 
SUMMARY:W08> Invited Talk "Tensor Decompositions and Applications: a Surv
 ey"
LOCATION:St Germain des Prés B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T100000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T094500
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5248
DESCRIPTION: 
SUMMARY:W08> Multi-Way Set Enumeration in Real-Valued Tensors
LOCATION:St Germain des Prés B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T102000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5249
DESCRIPTION: 
SUMMARY:W08> Invited Talk "Factorizing Matrices with Missing Entries: Alt
 ernative Approaches"
LOCATION:St Germain des Prés B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T111500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T110000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5250
DESCRIPTION: 
SUMMARY:W08> A Spectral-based Clustering Algorithm for Categorical Data U
 sing Data Summaries
LOCATION:St Germain des Prés B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T113000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T111500
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5251
DESCRIPTION: 
SUMMARY:W08> Sequential Latent Semantic Indexing
LOCATION:St Germain des Prés B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T120000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T113000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5252
DESCRIPTION: 
SUMMARY:W08> Invited Talk: "Tensors and n-d Arrays: Mathematics of Arrays
 \, Psi-Calculus\, and Composition of Tensor and Array Operations"
LOCATION:St Germain des Prés B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T121000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T120000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5253
DESCRIPTION: 
SUMMARY:W08> Recent Advances in Tensor Decomposition
LOCATION:St Germain des Prés B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T122500
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T121000
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5254
DESCRIPTION: 
SUMMARY:W08> Accuracy of Distance Metric Learning Algorithms
LOCATION:St Germain des Prés B
END:VEVENT
BEGIN:VEVENT
DTEND;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T124000
DTSTART;TZID=Europe/Paris;VALUE=DATE-TIME:20090628T122500
DTSTAMP;VALUE=DATE-TIME:20120516T153740Z
UID:5255
DESCRIPTION: 
SUMMARY:W08> Efficient Computation of PCA with SVD in SQL
LOCATION:St Germain des Prés B
END:VEVENT
END:VCALENDAR

