Five Unsolved Problems

Q1 2026 · The Sciencer Company

SODI 2026 Part 5: Five Unsolved Problems

Part 5 of State-Of the Data Infrastructure 2026


In Parts 1–4, we established the BI Assumption, traced the research community's shift, analyzed HTAP's limits, and quantified the AI readiness crisis. Now we get deeply technical. For each of five architectural capabilities that AI-native infrastructure requires, we examine what the leading platforms actually provide — including specific product GA dates, API limitations, and undocumented constraints — what external tools partially fill, and what remains genuinely unsolved.


Problem 1: Unified querying across structured and unstructured data

What agents need

An AI agent reasoning about a customer needs to JOIN structured data (account profile), semi-structured data (support tickets as JSON), unstructured data (contract PDFs, call recordings), and vector embeddings (semantic similarity) — in a single query operation with unified governance.

Snowflake's approach: FILE data type + Cortex AI

Snowflake's FILE data type (GA September 2025) stores metadata pointers as first-class column values:

SELECT
  c.customer_name, c.account_value,
  AI_TRANSCRIBE(r.call_recording) AS transcript,
  AI_PARSE_DOCUMENT(c.contract_pdf, 'renewal_date') AS renewal
FROM customers c
JOIN call_recordings r ON c.customer_id = r.customer_id
WHERE c.region = 'EMEA';

Limitations revealing BI heritage: FILE cannot appear in GROUP BY, ORDER BY, or CLUSTER BY. Unsupported in Hybrid/Iceberg/External tables. References become stale if files change. Cortex Search is a separate service object — not integrated into the SQL query planner. Cost monitoring via CORTEX_AISQL_USAGE_HISTORY (GA December 2025) is separate from warehouse compute credits.

Databricks' approach: Volumes + ai_query()

Databricks has no FILE-equivalent column type. Unstructured data enters SQL through READ_FILES(), a transient table-valued function. ai_query() (GA) is more flexible than Snowflake's task-specific functions — it can invoke any Model Serving endpoint — but ai_summarize, ai_classify, ai_extract remain in Public Preview. The typical pattern requires a medallion architecture: raw files in bronze Volumes → parsed content in silver Delta → embeddings in gold tables.

What's genuinely unsolved

True unified querying would require content-aware indexing as DDL primitives, cross-modal query plan optimization, unified cost estimation across modalities, and single-policy governance regardless of data format. Neither platform delivers this. Both bolt AI onto SQL rather than making multi-modal reasoning a first-class query planning primitive. As the Cambridge Report notes, "new abstractions and query optimization techniques may be needed to manage these AI-driven pipelines effectively."


Problem 2: Data versioning, branching, and ML-driven merge

What agents need

Git-like primitives for data: branches, diffs, merges, and automated quality gates — including ML-model-driven validation.

Platform-native capabilities

Snowflake Time Travel: Point-in-time queries up to 90 days. Zero-Copy Cloning creates metadata-level copies with copy-on-write at micro-partition granularity. No native merge-back, no diff, no branching concept.

Delta Lake: Transaction log with 30-day default retention. RESTORE TABLE for rollback. No branching, no tagging, no merge. The log is explicitly linear.

Databricks Lakebase (GA February 2026): Database-level branching via the Neon acquisition. Critical distinction: these are PostgreSQL OLTP database branches, not lakehouse table branches. Merge not yet supported — branches can be created/discarded but not merged back.

Apache Iceberg: Table-level branches via Write-Audit-Publish. Merge limited to fast-forward only (CALL system.fast_forward()). No three-way merge, no cherry-picking, no conflict resolution. Schema shared across all branches.

External tools

lakeFS (acquired DVC November 2025): Lake-wide branching, three-way merge using common ancestors, pre-merge webhook hooks for custom validation. The most capable data versioning tool available.

Project Nessie: Catalog-level branching with multi-table atomic transactions for Iceberg metadata. Read Committed, Repeated Read, Serialized isolation levels.

The ML-driven merge gap

No tool supports ML-model-driven merge decisions natively. The architecture is feasible: lakeFS pre-merge webhooks fire HTTP POST to an ML Gateway → gateway diffs branches → runs schema drift detection, distribution shift (KS test, PSI), anomaly detection (isolation forest), downstream model performance evaluation → returns accept/reject.

But this is entirely custom engineering. Academic research on ML-driven merge exists for code (DeepMerge, Gmerge), not data. This is a genuine research gap and a genuine product opportunity.


Problem 3: Multi-layer context grounding

AI agents need six layers of context: metadata/lineage, query history, human annotations, code enrichment, access control, and evaluation loops. Layer-by-layer:

Metadata and lineage: Databricks Unity Catalog captures column-level lineage automatically from Spark execution plans with 1-year retention and External Lineage API. Snowflake Horizon provides ACCESS_HISTORY with per-query column tracking plus automatic PII classification (CLASSIFICATION_PROFILE). Databricks leads on breadth; Snowflake leads on PII detection.

Query history: Snowflake QUERY_HISTORY provides 365-day retention with parameterized query hash, bytes scanned, cache hit rate, partition pruning stats, and credit consumption. Databricks system.query.history is Public Preview with less granular cost attribution. Snowflake leads clearly.

Human annotations: MLflow 3.0 provides mlflow.log_feedback() with typed assessments (HUMAN, CODE, LLM_JUDGE), Labeling Sessions for domain experts, and revision tracking. Snowflake has no native human annotation system. Databricks leads by a massive margin.

Code-level enrichment: Databricks links notebooks → lineage via Unity Catalog. MLflow links model versions → Git commits → evaluation runs → serving endpoints. Snowflake's Git integration (GA April 2024) provides repository stages but code-to-data connection is indirect. Databricks leads.

Access control: Snowflake offers Tri-Secret Secure (dual-key encryption, unique in industry). Databricks offers native ABAC in Unity Catalog with tag-driven policies. Concern: Snowflake Cortex Search bypasses dynamic data masking. Tie with different strengths.

Evaluation loops: MLflow 3.0 provides mlflow.genai.evaluate() with composable scorers, golden-set regression in CI/CD, and ALHF framework (MemAlign) for aligning LLM judges with human preferences. Snowflake DMFs handle table-level data quality; Cortex AI Observability (TruLens-powered, Public Preview) is far less integrated. Databricks leads decisively.

The integration gap

Databricks wins 4 of 6 layers. But no platform integrates all six into a coherent context engine. Unity Catalog lineage, MLflow tracing, MLflow feedback, and Mosaic AI Gateway evolved independently. The whole is less than the sum of parts.


Problem 4: AI control planes

Policy enforcement: endpoint-level, not action-level

Databricks Mosaic AI Gateway: PII filtering (Presidio), safety guardrails (Llama Guard), rate limiting, audit logging. Operates at model endpoint level — inspects text payloads, not agent actions.

The propose → authorize → execute pattern: agent proposes structured action → policy engine evaluates against authorization context → only authorized actions execute. No platform implements this. It requires formal action ontologies and runtime policy evaluation at the business logic level — an open research problem.

Semantic observability: MLflow Tracing vs. nothing

MLflow Tracing captures OpenTelemetry-compatible hierarchical span trees with typed span_type (AGENT, LLM, TOOL, RETRIEVER, EMBEDDING). 20+ framework integrations via one-line mlflow.<library>.autolog(). Assessment system attaches feedback directly to traces.

Snowflake has no native equivalent. The gap between "semantic observability" (understanding why an agent chose action A over B) and "query logging" (knowing a query executed) is the gap between causal debugging and event auditing.

Microsoft Foundry: most forward-looking vision

Foundry Control Plane (Public Preview, Ignite November 2025): Fleet Management, Entra Agent ID (agent-level identity with Zero Trust), AI Red Teaming Agent. The only implementation of agent identity in production. But Azure-locked, preview, and operates at prompt level not action level.


Problem 5: Autonomous infrastructure operations

What exists today

Every component of the modern data stack requires human configuration:

Why assembly doesn't work

Autonomous operations require a unified model of the data estate. Each point solution has a partial, incompatible model:

  • Airflow: DAGs of tasks
  • dbt: SQL nodes in a DAG
  • Monte Carlo: tables with quality scores
  • MLflow: experiments, runs, models, traces
  • Unity Catalog: hierarchical asset catalog
  • lakeFS: repositories with branches

These are six different ontologies. You cannot reconstruct global understanding from locally optimized partial views connected by narrow interfaces. When Monte Carlo detects a quality anomaly, it can alert a human — but it cannot automatically determine that the anomaly was caused by a pipeline change in Airflow, affected a feature in MLflow, degraded a model endpoint, and should trigger a rollback. That reasoning chain crosses six tools with no shared ontology.


Summary: the state of the industry

ProblemMaturityBest available
Unified structured + unstructured queryingEmergingSnowflake FILE + Cortex AI
Git-like data branching and mergingEmerginglakeFS + Iceberg or Nessie
ML-driven data merge decisionsNascentCustom build on lakeFS hooks
Multi-layer context groundingEmergingDatabricks (Unity + MLflow + Gateway)
Full AI control planeNascentMicrosoft Foundry CP (preview)
Autonomous infrastructure operationsAbsentNothing comprehensive

Four of six say "Emerging" or below. The best solutions require assembling 4–8 tools. The most consequential capabilities don't exist as products.


Next: Part 6: The Autonomous DataOps Thesis

Previous: Part 4: The AI Readiness Crisis

This post is part of State-Of the Data Infrastructure 2026, an eight-part series by The Sciencer Company.