##### Dagstuhl Seminar 14462

### Systems and Algorithms for Large-scale Graph Analytics

##### ( Nov 09 – Nov 12, 2014 )

##### Permalink

##### Organizers

- Derek Murray (San Francisco, US)
- Amitabha Roy (EPFL - Lausanne, CH)
- Eiko Yoneki (University of Cambridge, GB)

##### Contact

- Susanne Bach-Bernhard (for administrative matters)

##### Schedule

Graph analytics, the class of data analysis that deals with data forming networks, is emerging as a huge consumer of computational resources due to its complex, data-hungry algorithms. Social networking, personal medicine, text/graphics content analysis and search engines are only a few examples where Tera-, Peta- or even Exa-scale graph processing is required. Processing massive graphs is also applicable in bioinformatics and many other areas. Graph algorithms are becoming increasingly important for solving multiple problems in diverse fields. As these problems grow in scale, parallel computing resources are required to meet the computational and memory requirements. Notably, the algorithms, software and hardware that have worked well for developing mainstream parallel scientific applications are not usually effective for massive-scale graphs from real world, which exhibits more complex and irregular structure than traditional data in scientific computing.

Research into large scale graph processing within the computer science community is currently at an early and fragmented stage, and can be categorized into the broad sub-areas such as applications/ algorithms, databases, computer systems, and computer architecture. The seminar will bring together researchers and engineers from industry.

The seminar will address a range of research questions below.

- What is the correct algorithmic abstraction for systems handling large graphs? Algorithmic complexity researchers use PRAM and I/O complexity models to characterize the algorithmic complexity of graph processing. On the other hand, systems researchers largely build systems that implement scatter-gather and label propagation models of computation. These are different world views rendering theory and practice incompatible with each other. We will begin work towards a formal algorithmic model for existing large scale graph processing systems as part of this seminar with a view to answering this question. This model should accurately describe large-scale graph processing systems built by the systems community as well as be formal enough to enable algorithmic complexity researchers to draw useful conclusions about their scalability with data set size. This agenda item will require close co-operation between theoretical computer scientists on the algorithms front to talk to practical systems researchers. Historically this has been a difficult area for interdisciplinary interaction but the diverse background of the seminar organizers and our careful choice of invitees will go some way in surmounting this difficulty.
- What is the taxonomy of applications that graph processing systems should support? Can it be reduced to a set of representative benchmarks that researchers in this area need to care about? We can currently identify two main interest areas. The first is large scale graph traversal, of interest to the high performance computing and web-services community; primarily driven by security applications and from data mining needs. The second is spectral approaches, primarily of interest to the machine learning community, building systems such as Graphlab. The output from this agenda item will be a clear set of well defined applications that the community can agree will serve as objects of study for building high performance graph processing systems.
- What is the correct interface to the system that may be assumed when building a graph DSL? DSL researchers are interested in productive and easy ways to specify graph computations. However they have given relatively little thought to interfacing in an efficient way to systems that execute graph computation. The agenda item therefore will be a discussion between programming language researchers and systems, algorithms and database people researchers about the correct level of interface between a DSL and the underlying systems. A good model in this regard is the decoupling of database systems from ways to query them using declarative languages like SQL. The litmus test for success for this agenda item will therefore be a sketch for a DSL that exposes opportunities for optimization, is productive to use and at the same time is oblivious to the underlying system.
- What is the design trade-off among different graph processing approaches. For example, (i) the general graph processing system vs. the dedicated approaches specially optimized for specific graph problems, (ii) the running time between pre-processing and graph processing, (iii) performance vs. running expense.

Graph analytics, the class of data analysis that deals with data forming networks, is emerging as a huge consumer of computational resources due to its complex, data-hungry algorithms. Social networking, personal medicine, bioinformatics, text/graphics content analysis and search engines are a few examples where Tera-, Peta- or even Exa-scale graph processing is required. Graph algorithms are becoming increasingly important for solving multiple problems in diverse fields. As these problems grow in scale, parallel computing resources are required to meet the computational and memory requirements. Notably, the algorithms, software and hardware that have worked well for developing mainstream parallel applications are not usually effective for massive-scale graphs from the real world, which exhibits more complex and irregular structures than traditional data in scientific computing.

Research into large scale graph processing within the computer science community is currently at an early and fragmented stage. This seminar brought together researchers from systems, computer architecture, algorithms and databases to discuss emerging trends and to identify opportunities for future advancement. Prior to the seminar, we had prepared a range of research questions below.

- What is the correct algorithmic abstraction for systems handling large graphs? Algorithmic complexity researchers use PRAM and I/O complexity models to characterize the algorithmic complexity of graph processing. On the other hand, systems researchers largely build systems that implement scatter-gather and label propagation models of computation. These are different world views rendering theory and practice incompatible with each other. We will begin work towards a formal algorithmic model for existing large scale graph processing systems as part of this seminar with a view to answering this question. This model should accurately describe large-scale graph processing systems built by the systems community as well as be formal enough to enable algorithmic complexity researchers to draw useful conclusions about their scalability with data set size. This will require close co-operation between theoretical computer scientists on the algorithms front to talk to practical systems researchers.
- What is the taxonomy of applications that graph processing systems should support? Can it be reduced to a set of representative benchmarks that researchers in this area need to care about? We can currently identify two main interest areas. The first is large scale graph traversal, of interest to the high performance computing and web-services community; primarily driven by security applications and from data mining needs. The second is spectral approaches, primarily of interest to the machine learning community, building systems such as Graphlab. The output from this agenda item will be a clear set of well defined applications that the community can agree will serve as objects of study for building high performance graph processing systems.
- What is the correct interface to the system that may be assumed when building a graph DSL? DSL researchers are interested in productive and easy ways to specify graph computations. However they have given relatively little thought to interfacing in an efficient way to systems that execute graph computation. The agenda item therefore will be a discussion between programming language researchers and systems, algorithms and database people researchers about the correct level of interface between a DSL and the underlying systems. A good model in this regard is the decoupling of database systems from ways to query them using declarative languages like SQL. The litmus test for success for this agenda item will therefore be a sketch for a DSL that exposes opportunities for optimization, is productive to use and at the same time is oblivious to the underlying system.
- What is the design trade-off among different graph processing approaches. For example, (i) the general graph processing system vs. the dedicated approaches specially optimized for specific graph problems, (ii) the running time between pre-processing and graph processing, (iii) performance vs. running expense.

The seminar identified whole graph analytics and point queries on graphs that explore neighborhoods of vertices as distinct application domains, which require separate treatment and systems. All the participants agreed that there was an urgent need to standardize benchmarks and datasets in order to make meaningful progress with graph processing - particularly given the diverse nature of the communities involved. In addition, the seminar identified a number of interesting approaches and trends. There was also considerable participation from industry, which included work in graph databases as well as new systems architectures that will require practitioners to rethink traditional approaches for graph processing.

The seminar consisted of 6 sessions on focused topic presentations and discussions, followed by a joint session with the seminar 14461 on "High-performance Graph Algorithms and applications computational Science". At the last day of the seminar, the whole morning was dedicated to the discussion on the challenges and future directions of large-scale graph processing (see Section Challenges and Future directions).

- Paolo Boldi (University of Milan, IT) [dblp]
- Luis Ceze (University of Washington - Seattle, US) [dblp]
- Hassan Chafi (Oracle Labs - Belmont, US) [dblp]
- Lei Chen (HKUST - Kowloon, HK) [dblp]
- Felix Cuadrado (Queen Mary University of London, GB) [dblp]
- Valentin Dalibard (University of Cambridge, GB) [dblp]
- Khuzaima Daudjee (University of Waterloo, CA) [dblp]
- Šarunas Girdzijauskas (KTH Royal Institute of Technology, SE) [dblp]
- Torsten Grust (Universität Tübingen, DE) [dblp]
- Sungpack Hong (Oracle Labs - Belmont, US) [dblp]
- Kimberly Keeton (HP Labs - Palo Alto, US) [dblp]
- Terence P. Kelly (HP Labs - Palo Alto, US) [dblp]
- Arijit Khan (ETH Zürich, CH) [dblp]
- Wolfgang Lehner (TU Dresden, DE) [dblp]
- Andrew Lenharth (University of Texas - Austin, US) [dblp]
- Karthik Nilakant (University of Cambridge, GB) [dblp]
- Mark H. Oskin (University of Washington - Seattle, US) [dblp]
- John Owens (University of California - Davis, US) [dblp]
- M. Tamer Özsu (University of Waterloo, CA) [dblp]
- Roger Pearce (LLNL - Livermore, US) [dblp]
- Ella Peltonen (University of Helsinki, FI) [dblp]
- Stefan Plantikow (Neo Technology Germany, DE) [dblp]
- Amitabha Roy (EPFL - Lausanne, CH) [dblp]
- Sebastian Schelter (TU Berlin, DE) [dblp]
- Alexander Ulrich (Universität Tübingen, DE) [dblp]
- Eiko Yoneki (University of Cambridge, GB) [dblp]
- Willy Zwaenepoel (EPFL - Lausanne, CH) [dblp]

##### Classification

- data structures / algorithms / complexity
- optimization / scheduling
- software engineering

##### Keywords

- Large-scale graph processing
- Graph structured data
- Graph algorithms
- Graph traversal
- Machine Learning
- Databases
- Computer systems
- Computer architecture
- Parallel I/O
- Parallel programming
- Storage