SODI 2026 Part 2: The Database Landscape Is Shifting

Part 2 of State-Of the Data Infrastructure 2026

Every five years since the late 1980s, a group of the world's most prominent database researchers convenes to assess the state of their field and chart its future direction. In October 2023, 46 researchers — including Michael Stonebraker, Matei Zaharia, Samuel Madden, Daniel Abadi, Andrew Pavlo, Tim Kraska, Anastasia Ailamaki, and many of the people who built the systems we use today — gathered in Cambridge, Massachusetts. The resulting Cambridge Report on Database Research, published on arXiv in April 2025, is the most authoritative forward-looking document the database community produces ¹.

Reading it carefully reveals something remarkable: the database research community has already moved past the BI Assumption. Every major research direction they identify points toward AI-native infrastructure, autonomous operations, and workloads that look nothing like dashboard queries.

Understanding where the research community is heading matters for infrastructure builders because academic database research has an unusually strong track record of becoming production reality. PostgreSQL emerged from the POSTGRES research project at Berkeley. Apache Spark emerged from AMPLab. DuckDB applies decades of query processing research to embedded analytics. Snowflake's architecture draws on shared-disk distributed database research. The people writing this report have collectively built or influenced most of the data systems in production today.

The stable core and the Cambrian explosion

The Cambridge Report's diagnosis of the current landscape aligns with a16z's observation: core data processing systems have stabilized while supporting tools have proliferated. But the report goes further in explaining why.

The stabilization of the core is not because the architectural problems are solved. It's because the core was designed for a mature, well-understood workload: SQL analytics over structured data. Cloud-native architectures with disaggregated storage and compute — the pattern behind Snowflake, BigQuery, Databricks, and Redshift — have become the consensus approach for this workload. The report notes that the industry has "widely adopted the concept of disaggregated storage and compute, enabling a high degree of scalability and flexibility."

The Cambrian explosion at the edges — data observability, data catalogs, feature stores, vector databases, ML platforms, agent frameworks — represents the market's attempt to bolt AI capabilities onto BI infrastructure. Each new tool category fills a gap that the core wasn't designed to address. Monte Carlo fills the AI quality monitoring gap. Pinecone fills the vector storage gap. Tecton fills the feature serving gap. LangChain fills the agent orchestration gap.

But the Cambridge Report identifies a deeper structural shift underway that goes beyond tool proliferation.

Five signals that the database landscape is moving toward AI-native infrastructure

Signal 1: Composable database building blocks

The report highlights the emergence of "composable building blocks" — reusable, high-performance components that can be assembled into purpose-built data systems. Projects like Velox (Meta's execution engine), DataFusion (Apache's query processing framework), and Apache Calcite (query planning) are creating standardized interfaces for database operations.

The researchers discuss whether databases should aim for "POSIX-style specs for the interfaces between components (i.e., optimizer, storage manager, etc)" — standardized, interchangeable modules that allow mixing and matching components from different systems.

This is profoundly important for AI-native infrastructure. If database components become composable, it becomes feasible to build a system that uses a columnar engine for aggregation queries, a vector engine for similarity search, a document engine for unstructured retrieval, and a graph engine for metadata traversal — all within a single query plan, governed by a single optimizer. This is exactly what AI agents need, and it's exactly what the monolithic architectures of Snowflake and Databricks cannot provide without fundamental restructuring.

The report even considers the radical possibility that agent-based designs could apply to database system internals: "different components of complex systems interact through natural language" — where "a query optimizer might receive an NL description of a query, and translate it into an NL operator plan which an executor agent consumes." The authors are appropriately skeptical, questioning "if NL is useful for its expressivity and generality or merely a source of ambiguity when specifying system interfaces." But the fact that this is being seriously discussed by the people who build query optimizers signals how fundamentally the field's orientation is shifting.

Signal 2: Federation as a first-class concern

The Cambridge Report identifies data federation as an increasingly important research direction. As data platforms become more federated — with data spread across multiple engines, formats, and locations — "new architectural paradigms will emerge" for cross-engine query processing.

The researchers specifically note that federation is "increasingly important in regulated environments where data residency laws require data not to leave specific jurisdictions, such as the EU." Designing systems for this distributed form of execution "involves tackling challenges such as data heterogeneity and cross-border data transmission while ensuring data integrity and security protocols meet regulatory standards."

This is directly relevant to the sovereignty dimension of AI-native infrastructure (which we'll explore in Part 7). But it also speaks to a core architectural requirement for AI agents: the ability to query across heterogeneous data sources — operational databases, data warehouses, document stores, vector databases, external APIs — within a single logical operation, without manually orchestrating the data movement.

Current federation approaches use predicate pushdown to minimize data movement — pushing filter operations to remote data sources before transferring results. But AI agent workloads require a richer federation model: pushing AI functions (embedding generation, entity extraction, classification) to where the data lives, aggregating results across heterogeneous sources, and maintaining governance and lineage across the federated boundary. No current system does this well.

Signal 3: Automated infrastructure management

The report devotes significant attention to automated infrastructure management — what they term "control plane management, i.e., self-healing, auto-provisioning, and always-up features." This is the database research community's way of saying that manual infrastructure configuration is untenable.

Two specific research directions stand out:

Database virtualization: "a single database front-end automatically provisions and routes queries to the best infrastructure." This is precisely the autonomous infrastructure provisioning described in our proposal — where the system decides how to execute a query (which engine, which compute tier, which storage format) based on the query's characteristics, not on manual configuration.

Declarative infrastructure as a service: "declarative interfaces are used for specifying more than queries but also the infrastructure upon which systems run, with search and optimization systems that attempt to allocate infrastructure in the most cost-effective way." This extends the declarative paradigm from queries (SQL: what data, not how to get it) to infrastructure (declare the workload requirements, let the system determine the optimal infrastructure allocation).

Both of these directions — database virtualization and declarative infrastructure — are research formulations of what we call Autonomous DataOps. The database research community is converging on the same destination from theoretical first principles that the infrastructure market is approaching from practical necessity.

Signal 4: Green computing as a design constraint

The Cambridge Report explicitly calls out the environmental impact of data systems as "an under-addressed issue" and states that "incorporating sustainability as a first-class design principle is an important challenge for the future." This includes both the energy cost of computation and the embodied carbon cost of hardware — data storage and RAM specifically.

This is not a marginal concern. The report positions green computing alongside core data systems research, hardware innovation, and AI integration as one of the field's primary challenges. Coming from a community that historically focused on performance above all else, this signals a genuine shift in research priorities.

The connection to autonomous infrastructure is direct: the manual, over-provisioned, redundantly-copying data stack described in Part 1 is profoundly wasteful. Intelligent infrastructure that right-sizes compute, eliminates redundant processing, and operates on incremental changes rather than full-table scans would dramatically reduce energy consumption — not as a secondary benefit, but as a natural consequence of better engineering. We'll explore this in depth in Part 7.

Signal 5: LLMs and unstructured data as core database concerns

Perhaps the most striking signal in the Cambridge Report is how centrally LLMs and unstructured data figure in the research agenda. The report states unambiguously: "It is no longer science fiction to imagine 'querying' a corpus of thousands of documents or images to identify entities, trends, numbers, and figures. We can apply declarative database interface and principles to these new domains."

The specific research challenges they identify read like a requirements document for AI-native data infrastructure:

Efficient computational stacks for AI: "combining traditional techniques — e.g., data partitioning, caching, and embedding indexes — to handle the high cost of inference at scale"
Hallucination reduction: "databases and provenance tools playing a crucial role in validating outputs"
Complex agentic AI workflows: "compound AI systems improve functionality, robustness, and efficiency but introduce new trade-offs in latency and accuracy"
Optimizing AI-driven pipelines: "Drawing inspiration from traditional database transaction models, new abstractions and query optimization techniques may be needed to manage these AI-driven pipelines effectively"

That last point is particularly significant. The researchers are proposing that the transaction model — the foundational abstraction of database systems since the 1970s — needs to be extended to cover AI agent pipelines. When the people who invented transactions say transactions need to be rethought for AI, infrastructure builders should pay attention.

The open-source signal: DuckDB, Polars, and the embedded analytics revolution

Beyond the Cambridge Report, the open-source landscape reinforces the message that the database paradigm is shifting.

DuckDB — an embeddable analytical database engine — has gained extraordinary adoption in data science workflows where traditional client-server databases were never used. The Cambridge Report specifically calls it out as demonstrating "how database research can be translated into practical, widely accessible tools." DuckDB's architecture — vectorized execution, compressed columnar storage, zero-copy data sharing with Python DataFrames — represents a fundamentally different deployment model than Snowflake or BigQuery. It runs inside the application, not as an external service. For AI agents that need fast, local analytical queries as part of a reasoning chain, the embedded model is often more appropriate than the client-server model.

Polars applies Rust-based performance to DataFrame operations, achieving dramatic speedups over pandas. Together with Ibis (a portable DataFrame API), these tools suggest that the future of data processing may be more heterogeneous and more embedded than the centralized warehouse model assumes.

The Cambridge Report notes that "the idea of a single, universal language or paradigm (e.g., extending SQL) covering all data programming needs is unlikely, due to the diversity and specialization of data science tasks." This is an explicit rejection of the SQL-for-everything philosophy that underpins the modern data stack. The future is interoperable, heterogeneous systems — precisely what AI agents need and precisely what the BI stack doesn't provide.

What the research community sees that the market doesn't

The Cambridge Report reveals a research community that has internalized a truth the infrastructure market is still processing: the workloads that matter most in the next decade look nothing like the workloads that shaped the current generation of data systems.

The researchers see:

Unstructured data as a first-class database concern, not an adjacent workload handled by separate systems
Autonomous infrastructure management as a research imperative, not a nice-to-have feature
AI agent pipelines as requiring new transaction models and query optimization techniques, not just API wrappers around existing systems
Green computing as a design constraint on par with performance and cost
Composable database architectures that can be assembled for specific workload profiles, not monolithic systems optimized for a single pattern
Federation as essential for both regulatory compliance and workload heterogeneity

Every one of these research directions points away from the BI Assumption and toward AI-native infrastructure. The database research community is telling us, as clearly as an academic report can, that the foundation needs to change.

In Part 3, we'll examine one specific attempt to bridge the gap between transactional and analytical workloads: HTAP databases. The HTAP story is instructive because it shows both the promise and the limits of trying to evolve BI-era infrastructure toward AI-native workloads without changing the foundational assumptions.

References

Next: Part 3: The HTAP Bridge and Its Limits

Previous: Part 1: The BI Assumption

This post is part of State-Of the Data Infrastructure 2026, an eight-part series by The Sciencer Company.

A. Ailamaki, S. Madden, D. Abadi, et al. "The Cambridge Report on Database Research." arXiv:2504.11259, April 2025. ↩