09.11.14 - 12.11.14, Seminar 14462

Systems and Algorithms for Large-scale Graph Analytics

Diese Seminarbeschreibung wurde vor dem Seminar auf unseren Webseiten veröffentlicht und bei der Einladung zum Seminar verwendet.

Motivation

Graph analytics, the class of data analysis that deals with data forming networks, is emerging as a huge consumer of computational resources due to its complex, data-hungry algorithms. Social networking, personal medicine, text/graphics content analysis and search engines are only a few examples where Tera-, Peta- or even Exa-scale graph processing is required. Processing massive graphs is also applicable in bioinformatics and many other areas. Graph algorithms are becoming increasingly important for solving multiple problems in diverse fields. As these problems grow in scale, parallel computing resources are required to meet the computational and memory requirements. Notably, the algorithms, software and hardware that have worked well for developing mainstream parallel scientific applications are not usually effective for massive-scale graphs from real world, which exhibits more complex and irregular structure than traditional data in scientific computing.

Research into large scale graph processing within the computer science community is currently at an early and fragmented stage, and can be categorized into the broad sub-areas such as applications/ algorithms, databases, computer systems, and computer architecture. The seminar will bring together researchers and engineers from industry.

The seminar will address a range of research questions below.

  1. What is the correct algorithmic abstraction for systems handling large graphs? Algorithmic complexity researchers use PRAM and I/O complexity models to characterize the algorithmic complexity of graph processing. On the other hand, systems researchers largely build systems that implement scatter-gather and label propagation models of computation. These are different world views rendering theory and practice incompatible with each other. We will begin work towards a formal algorithmic model for existing large scale graph processing systems as part of this seminar with a view to answering this question. This model should accurately describe large-scale graph processing systems built by the systems community as well as be formal enough to enable algorithmic complexity researchers to draw useful conclusions about their scalability with data set size. This agenda item will require close co-operation between theoretical computer scientists on the algorithms front to talk to practical systems researchers. Historically this has been a difficult area for interdisciplinary interaction but the diverse background of the seminar organizers and our careful choice of invitees will go some way in surmounting this difficulty.
  2. What is the taxonomy of applications that graph processing systems should support? Can it be reduced to a set of representative benchmarks that researchers in this area need to care about? We can currently identify two main interest areas. The first is large scale graph traversal, of interest to the high performance computing and web-services community; primarily driven by security applications and from data mining needs. The second is spectral approaches, primarily of interest to the machine learning community, building systems such as Graphlab. The output from this agenda item will be a clear set of well defined applications that the community can agree will serve as objects of study for building high performance graph processing systems.
  3. What is the correct interface to the system that may be assumed when building a graph DSL? DSL researchers are interested in productive and easy ways to specify graph computations. However they have given relatively little thought to interfacing in an efficient way to systems that execute graph computation. The agenda item therefore will be a discussion between programming language researchers and systems, algorithms and database people researchers about the correct level of interface between a DSL and the underlying systems. A good model in this regard is the decoupling of database systems from ways to query them using declarative languages like SQL. The litmus test for success for this agenda item will therefore be a sketch for a DSL that exposes opportunities for optimization, is productive to use and at the same time is oblivious to the underlying system.
  4. What is the design trade-off among different graph processing approaches. For example, (i) the general graph processing system vs. the dedicated approaches specially optimized for specific graph problems, (ii) the running time between pre-processing and graph processing, (iii) performance vs. running expense.