SODI 2026 Part 8: A Proposal for AI-Native Data Infrastructure

Part 8 of State-Of the Data Infrastructure 2026

Over seven posts, we've established that modern data infrastructure was designed for BI (Part 1), that the database research community has pivoted toward AI-native systems (Part 2), that HTAP bridges the OLTP-OLAP gap but not the BI-AI gap (Part 3), that AI projects fail at scale due to data architecture (Part 4), that five specific capabilities are missing (Part 5), that assembly can't solve the problem (Part 6), and that intelligent infrastructure enables sovereignty and sustainability (Part 7).

This final post presents the architecture. Not a specification — a set of design principles, layer boundaries, and key abstractions that define what AI-native data infrastructure looks like when built from first principles.

Design principles

Seven principles guide the architecture. Each responds to a specific failure identified in this series.

Principle 1: Agents are first-class citizens. The primary consumer is an AI agent, not a human analyst. APIs optimized for programmatic, high-frequency, multi-modal access. Governance evaluates agent actions, not just queries. Quality monitors the full chain from data to action.

Principle 2: Unified data surface. All data modalities — structured, semi-structured, unstructured, vector — accessible through a single query interface with a single governance model and a single lineage system.

Principle 3: Open formats at the foundation. Apache Iceberg and/or Delta Lake for structured data. Parquet for columnar storage. Standard formats for unstructured data. Open lineage and metadata formats. Portability is a hard requirement.

Principle 4: Autonomous operations. Self-configuring, self-healing, self-tuning, self-governing. Human oversight available for every autonomous action. The default is autonomous operation; human intervention is the exception.

Principle 5: Versioned by default. All data versioned with full branching and merging semantics. Every write creates a version. Branches for experimentation. Merges gatable on automated evaluation including ML model impact. Rollback always available.

Principle 6: AI control plane. A metadata knowledge graph connecting data, features, models, agents, actions, and outcomes. Action-level policy enforcement. Semantic observability. Continuous evaluation with human feedback integration.

Principle 7: Sovereignty and sustainability by design. Deployable within any jurisdiction on any infrastructure. Energy efficiency as a design constraint: incremental processing, intelligent caching, workload-adaptive resource management.

Architectural layers

Layer 1: Universal Data Surface

Tables contain columns of any type: scalar, complex, file reference, and embedding — all fully supported in queries, joins, and governance. Content-aware indexes automatically generate and maintain vector representations when files are added. Incremental materialization replaces batch transformation: derived views are declarative, and materialization is incremental by default — only affected records recomputed when sources change.

The cross-modal query planner decomposes queries spanning multiple data types into sub-plans routed to appropriate engines (columnar scan for structured predicates, vector search for similarity, AI function for document parsing), with unified cost estimation.

This layer learns from both the HTAP community's hybrid scan research (Part 3) — where cost-based optimizers route sub-plans to row or column stores — and extends it to handle the full spectrum of data modalities AI agents require.

Layer 2: Versioned Storage

All data stored in a versioned, branchable layer built on open formats. An append-only commit log records every write with full metadata. Branches create instant zero-copy views supporting full DML. Implementation builds on Iceberg branching (table-level) and extends with cross-table atomicity (Nessie-inspired) and lake-wide scope (lakeFS-inspired).

Pre-merge hooks invoke configurable quality gates: schema drift detection, distribution shift testing, downstream model performance evaluation, custom ML scoring, and human approval workflows. This is where ML-driven merge decisions become architecturally native — the genuine research gap identified in Part 5.

Layer 3: AI Control Plane

A metadata knowledge graph connects all entities: data assets, compute assets, actors (humans and agents), actions, and outcomes. Queryable via graph query language (Cypher/SPARQL compatible) and via SQL through virtual tables materializing graph traversals.

Action-level policy enforcement implements propose → authorize → execute:

Agent proposes: {action: "issue_refund", customer_id: 12345, amount: 250.00}
Policy evaluates: Is agent authorized? Amount within limit? Customer eligible?
If authorized → execute. If denied → structured explanation.
All decisions → immutable audit trail in knowledge graph.

Semantic observability captures structured traces for every agent interaction: goal, data accessed, reasoning chain, tools invoked, action proposed, authorization decision, outcome. Continuous evaluation runs scorers against traces: correctness, faithfulness, safety, relevance, custom metrics. Golden-set regression on every configuration change. Human feedback flows back to scoring via ALHF patterns.

Layer 4: Autonomous Operations Engine

Source discovery: Declare a source (URI, connection string, API endpoint) → engine probes, discovers schema, infers types, configures incremental ingestion. Mode (batch, micro-batch, streaming, CDC) selected based on source capabilities and downstream latency requirements.

Automatic data organization: Observes query patterns, access frequency, join relationships → determines optimal storage layout, clustering, indexing, materialization. Hot/cold tiering managed automatically.

Self-healing pipelines: Source unavailable → switch downstream to cached state with staleness metadata → attempt reconnection with backoff → on schema change, generate migration plan → on recovery, backfill and validate.

Automatic governance: Content analysis classifies sensitivity (PII, jurisdiction, business domain) → policies applied based on classification → continuous compliance monitoring → all classifications auditable and overrideable.

Workload-adaptive compute: Predict patterns from history → pre-provision ahead of demand → release idle immediately → optimization target configurable: minimize cost, latency, energy, or weighted combination.

The Cambridge Report's vision of "database virtualization" and "declarative infrastructure as a service" maps directly to this layer.

Layer 5: Deployment Surface

Multi-target deployment: Same platform on AWS, Azure, GCP, sovereign cloud (OVHcloud, T-Systems, Scaleway), on-premises Kubernetes, and edge devices. Deployment target is configuration, not architecture.

Jurisdictional isolation: Data and compute pinned to regions with infrastructure-level enforcement. Cross-jurisdictional movement requires explicit policy authorization.

Federation: Multiple deployments share metadata, policies, and query capabilities while maintaining data locality. Queries decomposed so sub-plans execute where data resides — only aggregated, non-sensitive results cross boundaries.

What this enables

An organization running this architecture can deploy AI agents with confidence, iterate on data rapidly, comply with regulations automatically, reduce data engineering overhead by 70–90%, reduce energy consumption by 50–70%, and maintain full data sovereignty.

The road from here

This proposal describes a destination, not a position. Building this architecture requires advances in content-aware query planning, record-level change tracking at scale, graph-native metadata management, action-level policy evaluation at machine speed, and autonomous operations that humans can trust.

No one has built this yet. The closest approximations cover parts of the architecture. But the whole is greater than the sum of parts — and the parts don't assemble cleanly.

We believe this architecture needs to be built natively — as a single system with a unified ontology, a single governance model, and an integrated operations engine. Not because monoliths are better than composable systems, but because the specific problems (shared ontology, end-to-end latency, unified governance) are fundamentally integration problems that cannot be solved at the integration layer.

This is what The Sciencer Company is building with Any Lab.

Acknowledgments

This series draws on the a16z reference architectures, the Cambridge Report on Database Research (Ailamaki et al., 2025), the HTAP Databases survey (Zhang et al., 2024), platform documentation from Snowflake, Databricks, lakeFS, Nessie, MLflow, and Apache Iceberg, analyst research from Gartner, Forrester, S&P Global, and RAND Corporation, and original comparative analysis.

Russ Cox's Go & Versioning series demonstrated that detailed technical writing can reshape how an industry thinks about a hard problem. We hope this series contributes to reshaping how the data infrastructure industry thinks about the AI readiness crisis.

The problem is real. The evidence is clear. The architecture is feasible. Now it needs to be built.

This concludes State-Of the Data Infrastructure 2026, an eight-part series by The Sciencer Company.

If you're interested in the architecture described here: hello@thesciencer.com

If you're a data engineer, ML engineer, or infrastructure builder who finds this compelling: the.sciencer.company/careers