SODI 2026 Part 1: The BI Assumption

Part 1 of State-Of the Data Infrastructure 2026

In late 2020, Andreessen Horowitz published a set of reference architectures for modern data infrastructure ¹. The post identified three blueprints that data teams were implementing across the industry: Modern Business Intelligence, Multimodal Data Processing, and Artificial Intelligence & Machine Learning. When they updated these architectures in 2022, the core finding was striking: despite what a16z called a "Cambrian explosion" of new tools, the fundamental architecture had barely changed. The core — Fivetran for ingestion, Snowflake or Databricks for storage/compute, dbt for transformation — remained stable. The proliferation happened at the edges: observability, reverse ETL, metrics layers, data catalogs.

The reason the core hadn't changed is the same reason it needs to change now.

Every blueprint in the a16z reference architecture — and every production data stack we've examined in the two years since — is built on an assumption so pervasive it's invisible: the primary consumer of data is a human being. Specifically, a human being looking at a dashboard, running a SQL query, reviewing a report, or exploring a dataset in a notebook.

This assumption is embedded in the storage formats (columnar, optimized for the aggregation patterns humans care about). It's embedded in the query engines (SQL-first, because analysts speak SQL). It's embedded in the governance models (RBAC, because humans have roles). It's embedded in the data flow topology (unidirectional, because humans consume but rarely produce data at the infrastructure layer). It's embedded in the quality model (schema-on-write or schema-on-read, because humans can interpret messy data and compensate for gaps).

We call this the BI Assumption: the idea that data infrastructure exists to serve human analytical decision-making, with everything else — machine learning, automation, real-time systems — treated as secondary use cases bolted on after the fact.

The BI Assumption produced brilliant infrastructure for its intended purpose. Snowflake made SQL analytics effortless. dbt brought software engineering practices to data transformation. Fivetran eliminated the drudgery of connector management. Looker standardized metrics definitions. Monte Carlo told you when the data was wrong. The modern data stack, as crystallized in the a16z reference architectures, is a genuine engineering achievement.

But AI agents are not human analysts. And the BI Assumption is now the single largest obstacle to making enterprise AI work.

The anatomy of a BI-centric architecture

To understand why the BI Assumption matters, we need to trace it through every layer of the stack. Let's take the a16z "Modern Business Intelligence" blueprint as the canonical example, since it represents the architecture most enterprises actually run.

Data flows in one direction

The most fundamental architectural choice in the modern data stack is the data flow topology. In the BI blueprint:

Sources (SaaS apps, databases, event streams) emit data
Ingestion (Fivetran, Airbyte, Stitch) replicates it into cloud storage
Transformation (dbt, Dataform) reshapes it for analysis
Storage/Compute (Snowflake, BigQuery, Redshift) serves queries
Applications (Looker, Tableau, Power BI) present results to humans

Data flows left-to-right. The human sits at the terminal end. Once data reaches the warehouse and becomes a dashboard, the cycle is complete. Yes, there's "reverse ETL" (Hightouch, Census) that pushes warehouse insights back to operational systems — but this was literally an afterthought, arriving years after the core architecture solidified and named as the reverse of the primary flow.

This topology makes perfect sense for BI. Humans don't need real-time bidirectional data access. They check dashboards periodically. They run ad-hoc queries when investigating anomalies. They schedule reports. The latency from source change to dashboard update — minutes to hours — is perfectly acceptable for human decision cycles.

a16z noted in their 2022 update that core data processing systems remained "relatively stable" while tools around them proliferated. They attributed this stability to a "platform hypothesis" — the idea that cloud data warehouses and lakehouses had become platforms on which ecosystems of applications are built. This is correct, but the platform that emerged was a BI platform. The stable core was stable because the use case was stable: human analytical decision-making.

AI agents operate on a fundamentally different cycle. Consider what a sophisticated customer service agent does in a single interaction:

Reads customer profile from the warehouse (structured data)
Retrieves the customer's recent support tickets (semi-structured JSON)
Pulls the customer's contract PDF from document storage (unstructured)
Searches for similar resolved cases using vector embeddings (vector similarity)
Generates a resolution recommendation via LLM (inference)
Writes the recommendation back to the CRM (operational write)
Monitors the outcome over subsequent interactions (feedback loop)
Adjusts its strategy based on resolution success rates (learning)

This is a closed loop, not a pipeline. The agent is simultaneously a consumer and producer of data. It reads across structured and unstructured stores in a single operation. It writes results back at machine speed, not batch speed. It tracks what it did and why, for governance and debugging. The unidirectional pipeline topology of the modern data stack cannot accommodate this without extensive custom engineering at every integration point.

Storage was optimized for aggregation, not reasoning

Snowflake's micro-partition architecture is a masterpiece of columnar storage engineering. Data is automatically organized into 50–500 MB compressed columnar segments. Each micro-partition stores rich metadata: min/max values per column, null counts, distinct value counts, and Bloom filter membership. This metadata enables partition pruning — the query engine identifies relevant partitions in milliseconds by comparing query predicates against partition statistics, then scans only the needed columns within those partitions.

The optimization target is aggregation queries: SUM(revenue) WHERE region = 'EMEA' AND date BETWEEN '2025-01-01' AND '2025-03-31'. For this pattern, micro-partitions are extraordinary. The query engine prunes partitions by date range (using min/max metadata), reads only the revenue and region columns (columnar access), applies vectorized predicate evaluation, and produces the aggregate result. A query touching billions of rows can complete in seconds.

Databricks' Delta Lake takes a different storage approach — open Parquet files with a JSON-based transaction log (_delta_log/) — but optimizes for the same fundamental access patterns. Liquid clustering (replacing the earlier Z-ordering) reorders data within files based on specified columns to improve data skipping. The statistics collected per file (min/max, null counts, number of records) serve the same purpose as Snowflake's micro-partition metadata: enabling the query engine to skip irrelevant data for filter-heavy aggregation queries.

These optimizations are precisely wrong for AI agent workloads. Consider what an agent reasoning about a customer actually needs:

Data need	Format	Access pattern	Storage optimization
Customer profile	Structured rows	Point lookup by ID	Row-oriented, indexed
Support tickets	Semi-structured JSON	Filtered scan + nested field access	Document store, JSON indexes
Contract PDF	Unstructured binary	Full document parse + entity extraction	Object store, content index
Similar cases	Vector embeddings	Approximate nearest neighbor search	Vector index (HNSW, IVF)
Policy rules	Graph relationships	Multi-hop traversal	Graph store, adjacency index

No single storage architecture handles all five. Micro-partition pruning is irrelevant for vector similarity search. Columnar compression adds overhead for point lookups on individual records. Neither platform was designed to handle the cross-modal JOIN that an agent needs: "find customers with active contracts (structured) whose recent support sentiment (unstructured → AI function) is negative AND who are similar to churned customers (vector similarity)."

Snowflake's FILE data type (GA September 2025) represents a genuine architectural evolution — storing metadata pointers to staged files as first-class column values, queryable alongside structured columns via Cortex AI functions. But FILE values can't appear in GROUP BY, ORDER BY, or CLUSTER BY clauses. They're unsupported in Hybrid tables, Iceberg tables, and External tables. They become stale if underlying files change. These aren't bugs — they're the boundary conditions of a storage architecture designed for columnar aggregation encountering a data type that doesn't fit the paradigm.

Databricks approaches unstructured data through Unity Catalog Volumes — governed file storage with FUSE filesystem access. But there's no FILE-equivalent column type for Delta tables. Unstructured data enters SQL queries through READ_FILES(), a table-valued function that loads binary payloads into transient DataFrames. The path-based JOINs required (ON docs.path = customers.doc_path) are fragile, unoptimized, and lose the governance chain between the structured and unstructured domains.

The query engine assumes human interaction patterns

SQL query engines are optimized for a specific interaction pattern: a human (or scheduled job) submits a query, the engine parses, optimizes, and executes it, and returns a result set. This is a request-response pattern with certain properties:

Low concurrency per user: A human runs one query at a time, maybe a few concurrent dashboard widgets
Read-dominant: Analysts read far more than they write
Latency-tolerant: A 5-second query is acceptable for ad-hoc analysis
Stateless: Each query is independent; there's no conversational context between queries
Schema-aware: The human knows (or can discover) the table structure

AI agents violate every one of these properties. An agent processing a batch of customer interactions might issue hundreds of concurrent queries. Agents both read and write — retrieving context then storing decisions. Latency matters at a different scale: an agent serving a customer can't wait 5 seconds for each of a dozen retrieval queries. Agents maintain conversational state across queries — the second query depends on the first query's result. And agents may need to discover schema at runtime, not at design time.

The query optimization strategies in both Snowflake and Databricks reflect the human interaction assumption. Snowflake's multi-cluster warehouse architecture provisions virtual warehouses sized for expected human workloads: T-shirt sizes (XS through 6XL) selected at creation time. Auto-scaling responds to queue depth, not to the latency requirements of individual query patterns. The cost model charges by compute-time, not by result-value — creating perverse incentives for agent workloads where many small, fast queries are cheaper per-query but expensive per-wall-clock-second of warehouse uptime.

Databricks' serverless SQL compute improves on this with per-query billing, but the query lifecycle — parse → optimize → execute → return — still assumes independence between queries. There's no query session that maintains state, shares intermediate results, or amortizes catalog lookups across a series of related agent queries.

Governance assumes humans have roles, not agents have capabilities

Role-Based Access Control (RBAC) is the dominant governance model in both Snowflake and Databricks. The model is straightforward: users are assigned roles, roles are granted privileges on objects, and the system checks privileges at query time.

RBAC works beautifully for BI because the access patterns are predictable, low-frequency, and organizationally structured. An analyst in Marketing needs read access to the marketing schema. A data engineer needs write access to staging tables. The CFO needs dashboard access. Roles map cleanly to org charts, and privilege changes are infrequent.

AI agents shatter this model in at least four ways:

Delegation asymmetry. When a customer service agent processes a refund on behalf of a user, it needs temporary write access to the billing system — an action the human user is authorized to request but wouldn't directly perform. The correct authorization is the intersection of the user's delegation authority and the agent's operational capabilities. RBAC has no concept of this delegation model; it evaluates the agent's role independently of the user context.

Dynamic capability composition. A planning agent invokes a retrieval agent, a recommendation agent, and an execution agent in sequence. Each sub-agent needs different privileges. The composition of individually safe actions can be unsafe: reading customer PII (allowed) followed by writing to an external API (allowed) constitutes data exfiltration (forbidden). RBAC evaluates each operation independently.

Action-level semantics. RBAC controls access to objects — tables, schemas, databases. But the dangerous thing about an AI agent isn't which table it reads; it's what it does with the data. An agent reading customer addresses to generate shipping labels (fine) versus reading customer addresses to build a marketing profile (potentially forbidden under GDPR) performs identical SQL queries with different semantic intent. RBAC cannot distinguish them.

Speed and volume. RBAC was designed for human-speed access patterns — dozens to hundreds of authorization checks per user per hour. Agents operating at machine speed can trigger thousands of authorization checks per second. The governance infrastructure needs to be fast enough to not become a bottleneck, while remaining sophisticated enough to evaluate complex policies.

Databricks' Mosaic AI Gateway adds runtime policy enforcement at the model endpoint level: rate limiting, PII filtering via Microsoft Presidio, safety guardrails via Llama Guard. Snowflake's Cortex Guard provides binary safe/unsafe classification. Neither implements the propose → authorize → execute pattern — where an agent declares what it intends to do as a structured action, a policy engine evaluates the action against the full context (user delegation, agent capabilities, data sensitivity, business rules), and only authorized actions execute.

Quality means schema conformance, not semantic correctness

The data quality model in the BI stack is defined by three properties:

Schema conformance: Does the data match the expected types and constraints?
Statistical stability: Are distributions, volumes, and freshness within expected ranges?
Referential integrity: Do foreign keys resolve? Are there orphaned records?

Snowflake's Data Metric Functions (Enterprise Edition) evaluate SQL-defined numeric functions against tables on a schedule, with built-in metrics for NULL_COUNT, ROW_COUNT, ACCEPTED_VALUES, and FRESHNESS. Anomaly detection compares current DMF values against historical baselines. Databricks' Lakeflow Expectations verify SQL boolean expressions against Delta table rows at write time.

These tools answer a specific question: is the data structurally correct? They do not answer: will this data produce correct AI outputs?

The difference is enormous. A feature table might have perfect schema conformance, zero nulls, and stable distributions — while containing embeddings generated by a deprecated model version that produces subtly shifted vectors, causing a RAG pipeline to retrieve irrelevant documents 15% of the time. A training dataset might pass every statistical quality check while containing label noise that causes the model to misclassify high-value customer segments. A governance classification column might correctly identify PII in 99.5% of records while the 0.5% failure rate leaks sensitive data to an agent that processes millions of records per day.

AI quality is not data quality. It's the quality of the entire chain from data → features → model → prediction → action → outcome. This chain crosses system boundaries (warehouse → feature store → model serving → agent framework → operational system) and requires evaluation at every transition. No BI-era quality tool monitors this chain end-to-end, because the chain didn't exist when these tools were designed.

The platform hypothesis and its limits

a16z's "data platform hypothesis" argued that cloud data warehouses and lakehouses were becoming platforms — systems that third-party developers build on top of, creating mutual technical and economic dependence. In 2022, they described the architecture as a clear division: a "backend" (ingestion, storage, processing, transformation) consolidating around a few cloud vendors, with a "frontend" of applications built on top.

This hypothesis has been validated. Snowflake's Marketplace, Databricks' Partner Connect, and both platforms' extensive integration ecosystems demonstrate genuine platform dynamics. But the platform that emerged is a BI platform. The applications built on it — Looker, Sigma, Hex, Monte Carlo, Atlan, Census, Hightouch — are BI applications. They assume human consumers, read-dominant workloads, and unidirectional data flow.

The Cambridge Report on Database Research, published in April 2025 by 46 of the most prominent database researchers in the world — including Stonebraker, Zaharia, Madden, Abadi, and Pavlo — explicitly acknowledges that the platform layer needs to change. The report identifies "automated infrastructure management" as a key research direction, specifically calling out "self-healing, auto-provisioning, and always-up features" as areas where "many interesting advancements have been made and will continue to evolve" ². The researchers envision "database virtualization, where a single database front-end automatically provisions and routes queries to the best infrastructure" and "declarative infrastructure as a service" where systems optimize infrastructure allocation automatically.

These are not incremental improvements to the BI platform. They are a different kind of platform — one designed for autonomous operation, heterogeneous workloads, and machine-speed interaction. The database research community sees this clearly. The infrastructure industry hasn't caught up.

The convergence trap

Both Snowflake and Databricks recognize that AI changes the game. Their responses over the past two years have been ambitious:

Snowflake launched Cortex AI (SQL-callable LLM functions), Cortex Agents, Snowflake Intelligence, the FILE data type, Container Runtime for GPU workloads, and a $200M Anthropic partnership. They acquired Crunchy Data for PostgreSQL OLTP capabilities and are acquiring Select Star for lineage intelligence.

Databricks shipped MLflow 3.0, Agent Bricks, Mosaic AI Gateway, Lakebase (PostgreSQL OLTP via the $1B Neon acquisition), serverless GPU compute, and Unity Catalog ABAC. They acquired Tabular (Apache Iceberg's creators) for table format unification.

The pattern is unmistakable: both platforms are converging toward the same destination — a unified data + AI platform — from opposite starting points. Snowflake adds AI to its data warehouse. Databricks adds warehousing to its AI platform.

But convergence through accretion is architecturally different from building for AI from the ground up. Each new capability is constrained by design decisions made years ago for different use cases:

Snowflake's FILE data type can't appear in GROUP BY or CLUSTER BY because the micro-partition architecture wasn't designed for unstructured column types
Databricks' MLflow Tracing doesn't integrate with Unity Catalog lineage because they evolved as independent systems
Snowflake's Cortex Search bypasses dynamic data masking because the owners' rights execution model wasn't designed for AI-mediated retrieval
Databricks' Mosaic AI Gateway applies guardrails at the model endpoint level but can't inspect agent actions because the serving infrastructure was designed for stateless inference, not stateful agent reasoning

These aren't bugs. They're the inevitable consequence of building AI capabilities on infrastructure designed for BI. The BI Assumption is in the foundation, and you can't change the foundation by adding floors.

What comes next

The BI Assumption produced excellent infrastructure for business intelligence. The modern data stack is a genuine achievement that democratized data analytics for organizations of all sizes.

But AI agents need infrastructure designed for AI agents. Not infrastructure designed for dashboards with AI bolted on. The architectural gaps are not incremental — they're foundational. They explain why, despite tens of billions of dollars in enterprise AI investment, the vast majority of AI projects fail to reach production.

In Part 2, we examine what the database research community itself — the people who built the systems we use today — says about where the field is heading. The 2025 Cambridge Report on Database Research reveals a community that has already pivoted toward AI, autonomous operations, and unstructured data. The implications for infrastructure builders are profound.

References

Next: Part 2: The Database Landscape Is Shifting

This post is part of State-Of the Data Infrastructure 2026, an eight-part series by The Sciencer Company.

M. Casado, M. Bornstein, J. Li. "Emerging Architectures for Modern Data Infrastructure." a16z, 2022. https://a16z.com/emerging-architectures-for-modern-data-infrastructure/ ↩
A. Ailamaki, S. Madden, D. Abadi, et al. "The Cambridge Report on Database Research." arXiv:2504.11259, April 2025. ↩