Christos Faloutsos

Tutorial T3: Large Graph-Mining: Power Tools and a Practitioner's Guide
June 28 9:00AM
Numerous real-world datasets are in matrix form, thus matrix algebra, linear and multilinear, provides important algorithmic tools for analyzing them. The main type of datasets of interest in this tutorial are graphs. Important datasets modeled as graphs include the Internet, the Web, social networks (e,g Facebook, LinkedIn), computer networks, biological networks and many more.

We will discuss how we represent a graph as a matrix (adjacency matrix, Laplacian) and the important properties of those representations. We will then show how these properties are used in several important problems, including node importance via random walks (Pagerank), community detection (METIS, Cheeger inequality), graph isomorphism and graph similarity. Important dimensionality reduction techniques (SVD and random projections) will be discussed in the context of graph mining problems.

Furthermore, we provide a survey of the work on the epidemic threshold, node proximity and center-piece subgraphs. State-of-art graph mining tools for analyzing time evolving graphs will also be presented. Throughout the tutorial, patterns in static and time evolving, weighted and unweighted real-world graphs will be presented.

The target audience are data mining professionals who wish to know the most important matrix algebra tools, their applications in large graph mining and the theory behind them.

Prerequisites: Computer science background (B.Sc or equivalent); familiarity with undergraduate linear algebra.

Demos will be presented.
Demo D07 - SHIFTR: A Fast and Scalable System for Ad Hoc Sensemaking of Large Graphs
June 30 7:30PM
We present SHIFTR, a system that assists users in making sense of large scale graph data. Making sense of information represented as large graphs is a fundamental challenge in many data-intensive domains. We suggest the potential of strong synergies between the data mining, cognitive psychology, and HCI communities in matching powerful graph mining tools with insights into how people learn and interact with information, and here we present SHIFTR as one such application. SHIFTR adapts the Belief Propagation algorithm to target important sensemaking tasks such as flexibly reorganizing graph entities into multiple groups based on both positive and negative examples. SHIFTR scales linearly with the graph size through its fast algorithm, novel mList data structure, and externalization of graph meta data.

We demonstrate SHIFTR’s usage and benefits through real-world sensemaking scenarios using the DBLP dataset that has almost 2 million author-publication relationships.
A demo video of SHIFTR can be downloaded at http://www.cs.cmu.edu/~dchau/shiftr/shiftr.mov.
BBM: Bayesian Browsing Model from Petabyte-scale Data
June 29 4:00PM
Given a quarter of petabyte click log data, how can we estimate the relevance of each URL for a given query? In this paper, we propose the Bayesian Browsing Model (BBM), a new modeling technique with following advantages:
(a) it does exact inference;
(b) it is single-pass and parallelizable;
(c) it is effective.

We present two sets of experiments to test model effectiveness and efficiency. On the first set of over 50 million search instances of 1.1 million distinct queries, BBM outperforms the state-of-the-art competitor by 29.2% in log-likelihood while being 57 times faster. On the second click-log set, spanning a quarter of petabyte data, we showcase the scalability of BBM: we implemented it on a commercial MapReduce cluster, and it took only 3 hours to compute the relevance for 1.15 billion distinct query-URL pairs.
SNARE: A Link Analytic System for Graph Labeling and Risk Detection
June 30 4:00PM
Classifying nodes in networks is a task with a wide range of applications. It can be particularly useful in anomaly and fraud detection. Many resources are invested in the task of fraud detection due to the high cost of fraud, and being able to automatically detect potential fraud quickly and precisely allows human investigators to work more efficiently. Many data analytic schemes have been put into use; however, schemes that bolster link analysis prove promising. This work builds upon the belief propagation algorithm for use in detecting collusion and other fraud schemes. We propose an algorithm called SNARE (Social Network Analysis for Risk Evaluation). By allowing one to use domain knowledge as well as link knowledge, the method was very successful for pinpointing misstated accounts in our sample of general ledger data, with a significant improvement over the default heuristic in true positive rates, and a lift factor of up to 6.5 (more than twice that of the default heuristic). We also apply SNARE to the task of graph labeling in general on publicly-available datasets. We show that with only some information about the nodes themselves in a network, we get surprisingly high accuracy of labels. Not only is SNARE applicable in a wide variety of domains, but it is also robust to the choice of parameters and highly scalable linearly with the number of edges in a graph.
TANGENT: A Novel, 'Surprise-me' Recommendation Algorithm
June 29 12:00PM
Most of recommender systems try to find items that are most relevant to the older choices of a given user. Here we focus on the "surprise me" query: A user may be bored with his/her usual genre of items (e.g., books, movies, hobbies), and may want a recommendation that is related, but off the beaten path, possibly leading to a new genre of books/movies/hobbies.

How would we define, as well as automate, this seemingly selfcontradicting request? We introduce TANGENT, a novel recommendation algorithm to solve this problem. The main idea behind TANGENT is to envision the problem as node selection on a graph, giving high scores to nodes that are well connected to the older choices, and at the same time well connected to unrelated choices. The method is carefully designed to be (a) parameter-free (b) effective and (c) fast. We illustrate the benefits of TANGENT with experiments on both synthetic and real data sets. We show that TANGENT makes reasonable, yet surprising, horizon-broadening recommendations. Moreover, it is fast and scalable, since it can easily use existing fast algorithms on graph node proximity.
DynaMMo: Mining and Summarization of Coevolving Sequences with Missing Values
June 29 3:20PM
Given multiple time sequences with missing values, we propose DynaMMo which summarizes, compresses, and finds latent variables. The idea is to discover hidden variables and learn their dynamics, making our algorithm able to function even when there are missing values. We performed experiments on both real and synthetic datasets spanning several megabytes, including motion capture sequences and chlorine levels in drinking water.

We show that our proposed DynaMMo method
(a) can successFully learn the latent variables and their evolution;
(b) can provide high compression for little loss of reconstruction accuracy;
(c) can extract compact but powerful features for segmentation, interpretation, and forecasting;
(d) has complexity linear on the duration of sequences.
BGP-lens: Patterns and Anomalies in Internet Routing Updates
June 29 5:20PM
The Border Gateway Protocol (BGP) is one of the fundamental computer communication protocols. Monitoring and mining BGP update messages can directly reveal the health and stability of Internet routing. Here we make two contributions: firstly we find patterns in BGP updates, like self-similarity, power-law and lognormal marginals; secondly using these patterns, we find anomalies. Specifically, we develop BGP-lens, an automated BGP updates analysis tool, that has three desirable properties: (a) It is effective, able to identify phenomena that would otherwise go unnoticed, such as a peculiar `clothesline' behavior or prolonged `spikes' that last as long as 8 hours; (b) It is scalable, using algorithms are all linear on the number of time-ticks; and (c) It is admin-friendly, giving useful leads for phenomenon of interest.

We showcase the capabilities of BGP-lens by identifying surprising phenomena verified by syadmins, over a massive trace of BGP updates spanning 2 years, from the publicly available site datapository.net.
Doulion: Counting triangles in massive graphs with a coin
June 29 5:05PM
Counting the number of triangles in a graph is a beautiful algorithmic problem which has gained importance over the last years due to its significant role in complex network analysis. Metrics frequently computed such as the clustering coefficient and the transitivity ratio involve the execution of a triangle counting algorithm. Furthermore, several interesting graph mining applications rely on computing the number of triangles in the graph of interest.

In this paper, we focus on the problem of counting triangles in a graph. We propose a practical method, out of which all triangle counting algorithms can potentially benefit. Using a straight-forward triangle counting algorithm as a black box, we performed 166 experiments on real-world networks and on synthetic datasets as well, where we show that our method works with high accuracy, typically more than 99\% and gives significant speedups, resulting in even $\approx$ 130 times faster performance.
Large Human Communication Networks: Patterns and a Utility-Driven Generator
July 1 11:35AM
Given a real, and weighted person-to-person network which changes over time, what can we say about the cliques that it contains? Do the incidents of communication, or weights on the edges of a clique follow any pattern? Real, and in-person social networks have many more triangles than chance would dictate. As it turns out, there are many more cliques than one would expect, in surprising patterns.



In this paper, we study massive real-world social networks formed by direct contacts among people through various personal communication services, such as Phone-Call, SMS, IM etc. The contributions are the following: (a) we discover surprising patterns with the cliques, (b) we report power-laws of the weights on the edges of cliques, (c) our real networks follow these patterns such that we can trust them to spot outliers and finally, (d) we propose the first utility-driven graph generator for weighted time-evolving networks, which match the observed patterns. Our study focused on three large datasets, each of which is a different type of communication service, with over one million records, and spans several months of activity.