Dagstuhl-Seminar 24472: Regular Expressions: Matching and Indexing

Dagstuhl-Seminar 24472

Regular Expressions: Matching and Indexing

( 17. Nov – 22. Nov, 2024 )

(zum Vergrößern in der Bildmitte klicken)

Permalink

Bitte benutzen Sie folgende Kurz-Url zum Verlinken dieser Seite: https://www.dagstuhl.de/24472

Organisatoren

Inge Li Gørtz (Technical University of Denmark - Lyngby, DK)
Sebastian Maneth (Universität Bremen, DE)
Gonzalo Navarro (University of Chile - Santiago de Chile, CL)
Nicola Prezza (University of Venice, IT)

Kontakt

Marsha Kleinbauer (für wissenschaftliche Fragen)
Jutka Gasiorowski (für administrative Fragen)

Gemeinsame Dokumente

Dagstuhl Materials Page (Use personal credentials as created in DOOR to log in)

Summary

Show Summary

The Dagstuhl Seminar ``Regular Expressions: Matching and Indexing'' (24472) took place from November 17th to 22nd, 2024. The goal of this seminar was to bring together researchers from various research directions dealing with algorithmic aspects of regular expressions and finite automata. Regular expressions and finite automata lie at the foundations of Computer Science and have been used since the sixties in basic problems like compiler design. The key algorithmic challenge is regular expression matching, that is, efficiently identifying words of a regular language within a sequence or within the paths of a labeled graph. Over the years, there have been numerous algorithmic advances around the topic, while at the same time their applications have spread over too many different areas like information retrieval, databases, bioinformatics, security, and others, which not only make use of standard results but also pose new and challenging variants of the regular expression matching problem. The use of regular expressions has made its way even into current standards like SQL:2016 and SPARQL.

The seminar was meant to bring together expertise from communities involved with regular expressions that, despite working on closely-related topics, do not frequently meet: graph databases, compressed data structures, and streaming algorithms. The following specific points were addressed.

String Indexing for Complex Patterns.: The classic and well-studied string indexing problem is to preprocess a string such that subsequent pattern-matching queries (locate all occurrences of the query string within the indexed strings) can be supported efficiently. The goal is to achieve a fast query time in terms of the length of the query string while keeping the memory needed for the index small. Several string indexes that support “complex” pattern matching queries have recently appeared. Examples include indexes that support pattern matching with wildcards, gaps, elastic degenerate, and approximate string matching. Such complex patterns may be viewed as special restricted cases of regular expressions. Thus, studying these in the context of regular expressions may lead to a more unified understanding of these and new research directions.
Advanced Regular Expression Operators.: Nearly all existing methods for regular expression matching consider regular expressions constructed using the concatenation, union, and Kleene star operators. In this setting, the complexity of the problem is well understood, and upper bounds with near-matching conditional lower bounds are known. However, more “advanced” operators that allow increased expressiveness are widely used in practice. These include backreferences (matching a previously matched subexpression), intervals (matching a subexpression a given number of times), character classes (matching a character to a set of characters), variable length gaps (matching any string of a length within a specified interval). In contrast to the classic setting, regular expression matching with these operators is much less well-understood. While some of these are known to be much harder from a complexity-theoretic viewpoint (i.e., regular expression matching with backreferences is, in general, NP-hard), recent developments have led to new methods handling many practically essential variants of the problem. Examples include deterministic regular expressions, pattern matching with variables, pattern matching with variable length gaps, and elastic degenerate strings. This enables a new study area that is theoretically challenging and practically relevant.
Indexing Automata and Regular Expressions.: The counterpart of the “string indexing for complex patterns” problem is to index/pre-process a regular expression to speed up tasks such as deciding membership of a string in its accepted language. An equivalent formulation of the problem, which is lately receiving much attention in the research community, is to index finite-state automata for path queries: given a string at query time, decide whether the string occurs in some walk on the automaton (equivalently, it is a substring of some string in the automaton’s accepted language). While recent research has shown quadratic (size of the regular expression/automaton multiplied by the length of the query string) conditional lower bounds for the problem, particular classes of regular expressions and automata do admit efficient solutions: deterministic regular expressions, elastic degenerate strings, and Wheeler automata are examples of this kind. Even more interestingly, other lines of research showed that the problem can be parameterized, thus classifying all regular expressions and automata according to their propensity to support membership/path queries: examples of such parameters include the number of strings in the regular expression (equivalently, the number of union symbols and Kleene stars), the co-lexicographic width of an automaton, and the density (amount of nondeterminism) of a regular expression. These directions opened an exciting new research area of both theoretical and practical interest with applications including bioinformatics (with the problem of indexing large pan-genomic graphs for path queries) and graph databases.
Graph Databases and Regular Path Queries.: Another flavor of regular expression matching arises in the context of graph databases, where the data consists of a labeled directed graph and queries involve different forms of subgraph pattern matching. The SPARQL standard, as well as several other alternative graph query languages, support in particular “regular path queries (RPQs)”, which are essentially regular expressions that are to be matched against the sequence of labels of graph paths. The basic strategy is to traverse the product graph between the database graph and the automaton of the regular expression. This leads to a quadratic time complexity, but other strategies based on graph traversals, pre-indexing the graph, or multiplication of sparse matrices, perform better in many practical cases. The most basic query aims at returning the endpoints of the paths found, but others ask for all the paths, the shortest paths, and so on. There has also been work on lower bounds on this problem, particularly relating it to the well-known AGM bounds that hold for matching a given subgraph shape. Strategies developed for regular expression matching on strings can be leveraged to solve RPQs, but this requires a cross-fertilization between the stringology and the database community in order to transfer the knowledge: most solutions implemented in actual graph database systems are just heuristics that have been shown to be clearly inferior to solutions grounded on more solid algorithmic foundations.
Regular Expression Matching on Streams.: Another relevant research area, motivated by big data applications, is that of regular expression matching on data streams: pre-process a regular expression R so that, later, we can identify all suffixes S[1, i] of an incoming stream S such that S[j, i] matches R for some j ≤ i, using space sub-linear in |S| and |R|. This problem generalizes a seminal work by Porat and Porat (2009) on exact pattern matching and on the k-mismatches problem on streams. A subsequent prolific line of works improved the original bounds and extended these results to edit distance, to the dictionary problem, and to wildcard matching. Recent research has shown that the problem admits solutions for arbitrary regular expressions using space polynomial in the logarithm of the stream’s length and in the number of union symbols and Kleene stars in the regular expression. This shows that parameterized approaches (as the ones for the problem of indexing automata and regular expressions) are indeed useful also in the streaming setting, and therefore strongly motivates an exchange of ideas between the two sub-fields.

The seminar fully satisfied our expectations. The 26 participants from 14 countries (Chile, Denmark, Finland, France, Germany, Great Britain, Israel, Italy, Japan, Netherlands, Poland, South Africa, Spain, and US) gave invited survey talks covering the state of the art in scientific fields related with the topics of the seminar. The talks presented works related with algorithms for regular expression matching (covering classic algorithms, extensions of regular expressions, indexing, streaming, and compressed data structures), regular expressions in graph databases, and practical tools used at the industrial scale.

We are really thankful to Schloss Dagstuhl for providing an extremely inspiring and professional environment. Scientific talks were interleaved with coffee breaks and cheese tastings which fostered engagements of the participants, and the evening ``sessions'' in the sauna, wine cellar, and music room offered a relaxed environment to continue chatting about research in a less formal environment.

Creative Commons BY 4.0

Inge Li Gørtz, Sebastian Maneth, Gonzalo Navarro, and Nicola Prezza

Motivation

Show Motivation

Regular expressions and finite automata lie at the foundations of Computer Science and have been used since the sixties in basic problems like compiler design. The key algorithmic challenge is regular expression matching, that is, efficiently identifying words of a regular language within a sequence.

Over the years, there have been numerous algorithmic advances around the topic, while at the same time, their applications have spread over too many different areas like information retrieval, databases, bioinformatics, security, and others, which not only make use of standard results but also pose new and challenging variants of the regular expression matching problem. The use of regular expressions has made its way even into current standards like SQL:2016 and SPARQL. Bringing together researchers from core stringology and relevant application areas will benefit both sides, giving the opportunity to exchange novel problems and solutions of theoretical and practical nature.

The seminar aims to bring together researchers from various research directions within algorithmic aspects of regular expressions and finite automata. Furthermore, the seminar will inspire the exchange of theoretical and practical results. Our aims are to identify practically relevant restrictions and extensions of regular expression matching, as well as variants that work on graphs rather than sequences, and propose matching and indexing algorithms to handle those, together with related impossibility results.

Creative Commons BY 4.0

Inge Li Gørtz, Sebastian Maneth, Gonzalo Navarro, and Nicola Prezza

Teilnehmer

Zeige Teilnehmer

Please log in to DOOR to see more details.

Antoine Amarilli (INRIA Lille, FR) [dblp]
Hideo Bannai (Institute of Science Tokyo, JP) [dblp]
Ruben Becker (University of Venice, IT) [dblp]
Giulia Bernardini (University of Trieste, IT) [dblp]
Philip Bille (Technical University of Denmark - Lyngby, DK) [dblp]
Manuel Cáceres (Aalto University, FI) [dblp]
Davide Cenzato (University of Venice, IT) [dblp]
James Davis (Purdue University - West Lafayette, US) [dblp]
Dominik D. Freydenberger (Loughborough University, GB) [dblp]
Pawel Gawrychowski (University of Wroclaw, PL) [dblp]
Adrián Gómez Brandón (University of Coruña, ES) [dblp]
Inge Li Gørtz (Technical University of Denmark - Lyngby, DK) [dblp]
Roberto Grossi (University of Pisa, IT) [dblp]
Moshe Lewenstein (Bar-Ilan University - Ramat Gan, IL) [dblp]
Konstantinos Mamouras (Rice University - Houston, US) [dblp]
Sebastian Maneth (Universität Bremen, DE) [dblp]
Wim Martens (Universität Bayreuth, DE) [dblp]
Yasuhiko Minamide (Institute of Science Tokyo, JP) [dblp]
Gonzalo Navarro (University of Chile - Santiago de Chile, CL) [dblp]
Nicola Prezza (University of Venice, IT) [dblp]
Cristian Riveros (PUC - Santiago de Chile, CL) [dblp]
Markus L. Schmid (HU Berlin, DE) [dblp]
Teresa Steiner (Technical University of Denmark - Lyngby, DK) [dblp]
Michelle Sweering (CWI - Amsterdam, NL) [dblp]
Simon Rumle Tarnow (Technical University of Denmark - Lyngby, DK) [dblp]
Brink van der Merwe (University of Stellenbosch, ZA) [dblp]

Klassifikation

Data Structures and Algorithms

Schlagworte

finite automata
regular expressions
complex patterns
text indexing
graph matching and indexing

Seminar 24472

Suche auf der Schloss Dagstuhl Webseite

Schloss Dagstuhl Services

Seminare

Innerhalb dieser Seite:

Externe Seiten:

Publishing

Innerhalb dieser Seite:

Externe Seiten:

dblp

Innerhalb dieser Seite:

Externe Seiten:

Dagstuhl-Seminar 24472

Regular Expressions: Matching and Indexing

( 17. Nov – 22. Nov, 2024 )

Permalink

Organisatoren

Kontakt

Gemeinsame Dokumente

Publikationen

Impacts

Programm

Summary

Motivation

Teilnehmer

Klassifikation

Schlagworte