Dagstuhl Seminar 25182: Challenges and Opportunities of Table Representation Learning

Dagstuhl Seminar 25182

Challenges and Opportunities of Table Representation Learning

( Apr 27 – May 02, 2025 )

(Click in the middle of the image to enlarge)

Permalink

Please use the following short url to reference this page: https://www.dagstuhl.de/25182

Organizers

Carsten Binnig (TU Darmstadt, DE)
Julian Martin Eisenschlos (Google Research - Zürich, CH)
Madelon Hulsebos (CWI - Amsterdam, NL)
Frank Hutter (Prior Labs - Freiburg, DE & ELLIS Institute Tübingen, DE & Universität Freiburg, DE)

Contact

Michael Gerke (for scientific matters)
Susanne Bach-Bernhard (for administrative matters)

Summary

Show Summary

The Dagstuhl Seminar 25182, held from April 27 to May 2, 2025, brought together researchers from machine learning, natural language processing, information retrieval, and databases to discuss the challenges and vision for Table Representation Learning (TRL). As structured data continues to grow in volume and importance, TRL aims to build representations that enable downstream tasks such as prediction, question answering, data preparation. The seminar served as a forum to share long-term visions, highlight challenges, and discuss research directions that bridge across these diverse communities.

We opened the seminar with a series of opinionated tutorials that laid out the research landscape. Carsten Binnig, taking a database systems perspective, argued that TRL could help eliminate what he termed the “data tax,” “query tax,” and “tuning tax” that currently burden users of relational databases. By automating tasks such as query authoring, data cleaning, and performance optimization, TRL could significantly reduce entry barriers to database use. Julian Eisenschlos followed with an NLP-centered overview of table question answering. He reviewed benchmarks and encoding strategies, emphasizing the trade-offs between interpretability and computational cost. He highlighted how pre-training tasks and generation-based understanding link table reasoning with broader modalities such as charts and infographics, while also pointing to the challenges of training models in a post-LLM landscape. Madelon Hulsebos shifted the focus to table semantics, stressing that much of TRL happens “before generating insights” She argued that understanding table- and column-relationships is foundational for data preparation, search and retrieval, and predictive modeling. Her talk questioned whether billion-parameter general-purpose models are necessary for tabular tasks, or whether modular, specialized systems might be more effective. Finally, Frank Hutter presented recent advances in deep learning for tabular prediction. After reviewing early methods, he discussed TabPFN and its extensions, which aim to overcome previous limitations. His talk underscored TRL’s growing ability to rival or surpass traditional tabular learning methods while also pointing out open challenges in scaling, context integration, and generalization.

The opinionated tutorials were complemented by shorter impulse talks that addressed specific aspects of TRL. Paolo Papotti reviewed transformer-based adaptations for tabular data, covering innovations in inputs, internals, outputs, and pretraining. Gaël Varoquaux analyzed architectural challenges in building table foundation models, particularly around heterogeneity and invariances. Michael Cochez argued for tighter links between graph learning and TRL to handle complex relations and incomplete data. Xue Effy Li emphasized that meaningful table representation requires integrating contextual information such as metadata, documentation, and world knowledge. Gerardo Vitagliano demonstrated the promise of multimodal pipelines that integrate genomic, imaging, textual, and tabular data for scientific discovery, while exposing current limitations. Andreas Müller shared a vision for deep integration of human feedback, LLMs, databases and table foundation models into agentic systems. Shuaichen Chang discussed the challenges of deploying Text-to-SQL in real-world settings, where ambiguity, noise, and mixed data require robust solutions. Finally, Fatma Özcan showed how long-context reasoning and multi-agent systems can improve Text-to-SQL robustness.

After the talks, we divided into working groups to explore research problems in more depth. The Multi-Modal Data Analysis group examined how to query and process data spanning text, images, genomics, and audio, emphasizing the need for new operators, adaptive indexing, and cost-aware query planning. The Predictive ML and Context group discussed how to integrate metadata, external knowledge, and domain expertise into statistical tabular prediction, proposing hybrid architectures that combine foundation models with agentic systems. The Conversational Analytics group envisioned natural language interfaces that go beyond text-to-SQL by supporting explanations, causal reasoning, and iterative dialogue with human oversight. The Architectures for Table Foundation Models group debated whether a universal foundation model for tables is achievable, weighing trade-offs between adaptability, semantic grounding, and efficiency.

Reflecting on the seminar’s discussions, we identified a few cross-cutting themes that seem promising for future research. First, context, whether in the form of metadata, domain knowledge, or multimodal signals, was consistently identified as crucial for making TRL robust and useful. Second, the limitations of current benchmarks remain a bottleneck, as they fail to capture the complexity of real-world tabular reasoning. Finally, participants questioned the pursuit of monolithic “one-size-fits-all” tabular models, favoring instead modular or hybrid systems that can flexibly combine database principles, machine learning, and human expertise. We concluded the seminar with a shared recognition that while TRL has achieved significant progress, its long-term promise lies in deeply integrating methods across disciplines to build more adaptive, interpretable, and context-aware tabular intelligence.

Creative Commons BY 4.0

Carsten Binnig, Julian Martin Eisenschlos, Madelon Hulsebos, and Frank Hutter

Motivation

Show Motivation

The increasing amount of data being collected, stored, and analyzed induces a need for efficient, scalable, and robust methods to handle this data. Representation learning, i.e. the practice of leveraging neural networks to obtain generic representations of data objects, has been shown effective for various applications over data modalities such as images and text. More recently, representation learning has shown initial impressive capabilities on structured data (e.g. relational tables in databases), for a limited set of tasks in data management and analysis, such as data cleaning, insight retrieval, and data analytics. Most applications traditionally relied on heuristics and statistics, which are limited in robustness, scale, and accuracy. The ability to learn abstract representations across tables unlocked new opportunities, such as pretrained models for data augmentation and machine learning, that address these limitations. This emerging research area, which we refer to as Table Representation Learning (TRL), receives increasing interest from industry as well as academia , in particular in the communities of data management, machine learning, and natural language processing.

This growing interest is a result of the high potential impact of TRL in industry given the abundance of tables in the organizational data landscape, the large range of high-value applications relying on tables, and the early state of TRL research so far. That is, recently, specialized TRL models for embedding (relational) tables as well as prompting methods for LLMs over structured data residing in databases have been developed and shown effective for various tasks, e.g. data preparation, machine learning, and question answering. However, studies have revealed shortcomings of existing models regarding their ability to capture the structure of tables , the relationships among tables, the heterogeneity (e.g. numbers, dates, text), biases and semantics of the data contents, limited generalization to new domains, unaddressed privacy constraints, etc. These challenges are merely the first limitations surfaced so far and we expect to identify more limitations of existing approaches through discussions, talks, and hands-on sessions at the TRL Dagstuhl Seminar.

As we stand at the starting point of developing and adopting high-capacity neural models (e.g. through representation or generative learning) for structured data, there is a wide range of applications that have not been addressed yet. For example, pretrained models for tabular machine learning have been explored only to a limited extent, whereas “upstream” data management applications, such as automated data validation and query and schema optimization, have not been explored so far. Therefore, another objective of this Dagstuhl Seminar is to identify novel application areas, build first prototypes to assess the potential, and develop research agendas towards further exploration of these applications. Moreover, beyond these unexplored applications, we aim to develop a manifesto that brings forward a common long-term vision for TRL with moon-shot ideas and the road to get there, which requires perspectives from experts in academia and industry.

Creative Commons BY 4.0

Carsten Binnig, Julian Martin Eisenschlos, Madelon Hulsebos, and Frank Hutter

Participants

Show Participants

Carsten Binnig (TU Darmstadt, DE) [dblp]
Vadim Borisov (tabularis.ai - Tübingen, DE) [dblp]
Shuaichen Chang (Amazon Web Services - New York, US) [dblp]
Michael Cochez (VU Amsterdam, NL) [dblp]
Tianji Cong (University of Michigan - Ann Arbor, US) [dblp]
Katharina Eggensperger (Universität Tübingen, DE) [dblp]
Julian Martin Eisenschlos (Google Research - Zürich, CH) [dblp]
Floris Geerts (University of Antwerp, BE) [dblp]
Filip Gralinski (Snowflake - Warsaw, PL)
Madelon Hulsebos (CWI - Amsterdam, NL) [dblp]
Frank Hutter (Prior Labs - Freiburg, DE & ELLIS Institute Tübingen, DE & Universität Freiburg, DE) [dblp]
Myung Jun Kim (INRIA Saclay - Île-de-France, FR) [dblp]
Andreas Kipf (TU Nürnberg, DE) [dblp]
Tassilo Klein (SAP SE - Walldorf, DE) [dblp]
Xue Li (CWI - Amsterdam, NL)
Andreas Müller (Microsoft Corp. - Mountain View, US) [dblp]
Olga Ovcharenko (TU Berlin, DE)
Fatma Özcan (Google - San Jose, US) [dblp]
Paolo Papotti (EURECOM - Biot, FR) [dblp]
Lennart Purucker (Universität Freiburg, DE) [dblp]
Anupam Sanghi (TU Darmstadt, DE) [dblp]
Sebastian Schelter (TU Berlin, DE) [dblp]
Shivam Sharma (TU Darmstadt, DE)
Immanuel Trummer (Cornell University - Ithaca, US) [dblp]
Gaël Varoquaux (INRIA Saclay - Île-de-France, FR) [dblp]
Gerardo Vitagliano (MIT - Cambridge, US) [dblp]
Liane Vogel (TU Darmstadt, DE) [dblp]

Classification

Artificial Intelligence
Computation and Language
Databases

Keywords

Representation and Generative Learning for Data Management and Analysis
Applications of Table Representation Learning
Benchmarks and Datasets for Table Representation Learning
Pre-trained (Language) Models for Tables and Databases

Seminar 25182

Search the Dagstuhl Website

Schloss Dagstuhl Services

Seminars

Within this website:

External resources:

Publishing

Within this website:

External resources:

dblp

Within this website:

External resources:

Dagstuhl Seminar 25182

Challenges and Opportunities of Table Representation Learning

( Apr 27 – May 02, 2025 )

Permalink

Organizers

Contact

Publications

Impacts

Schedule

Summary

Motivation

Participants

Classification

Keywords