SODI 2026 Part 3: The HTAP Bridge and Its Limits

Part 3 of State-Of the Data Infrastructure 2026

Before the modern data stack, before Fivetran and dbt and Snowflake, the enterprise data architecture had a different pain point: the ETL gap. Transactional systems (OLTP) recorded business events. Analytical systems (OLAP) answered business questions. Between them sat an Extract-Transform-Load pipeline that could take hours — sometimes days — to propagate changes from one world to the other.

In 2014, Gartner coined the term Hybrid Transactional/Analytical Processing (HTAP) to describe a new class of systems that would eliminate this gap: databases that could handle both transactions and analytics in a single system, with real-time data freshness ¹. The vision was compelling: no more ETL. No more stale dashboards. Business decisions informed by data that's seconds old, not hours old.

A decade later, HTAP is a mature field with production systems at every major database vendor. Oracle, SQL Server, SAP HANA, TiDB, MySQL Heatwave, Snowflake Unistore, AlloyDB — all claim HTAP capabilities. Academic research has produced comprehensive taxonomies, five distinct architectural patterns, and dedicated benchmarks ².

The HTAP story matters for our argument because it represents the most sophisticated attempt to evolve database architecture beyond the BI Assumption — and it reveals, with technical precision, exactly where evolution reaches its limits.

The fundamental HTAP trade-off

Every HTAP system must confront a trade-off that has proven impossible to eliminate: data freshness versus performance isolation ².

Data freshness measures how quickly changes from transactional operations become visible to analytical queries. In an ideal HTAP system, an INSERT committed by the OLTP engine would be immediately visible to an OLAP query — zero lag. In practice, freshness ranges from microseconds (HyPer's virtual memory snapshots) to seconds or minutes (TiDB's log-based replication).

Performance isolation measures how well the system prevents OLTP and OLAP workloads from interfering with each other. In a fully isolated system, running a complex analytical scan has zero impact on transaction throughput. In practice, isolation degrades as freshness improves: systems that share memory for real-time freshness (like HyPer) see up to 40% OLTP throughput degradation under concurrent analytical load ².

This trade-off is not a design flaw — it's a physical constraint. OLTP and OLAP workloads have fundamentally different computational profiles:

Property	OLTP	OLAP
Data access	Point lookups, small ranges	Full table scans, large aggregations
Duration	Milliseconds	Seconds to minutes
Concurrency	Thousands of short transactions	Handful of long-running queries
Resource pattern	CPU-bound, memory-latency-sensitive	I/O-bound, memory-bandwidth-intensive
Optimal storage	Row-oriented (locality per record)	Column-oriented (locality per attribute)

No single storage format, memory layout, or execution strategy is optimal for both. HTAP systems manage this by maintaining two storage formats — a row store for OLTP and a column store for OLAP — and synchronizing between them. The architectural choices around how this synchronization works define the four major HTAP architectures.

The four HTAP architectures

The academic taxonomy, as formalized by Zhang et al. ², classifies production HTAP systems into four architectures based on their storage strategy and processing paradigm.

Architecture A: Primary Row Store + In-Memory Column Store

Representatives: Oracle In-Memory ³, SQL Server ⁴, DB2 BLU ⁵

This architecture maintains a primary row store on disk for OLTP durability and adds an in-memory column store (IMCS) for OLAP acceleration. The row store handles transactional reads and writes with ACID guarantees via MVCC. Recent data changes accumulate in an in-memory delta structure. The delta is periodically merged into the column store.

Oracle's implementation combines a row-based buffer cache with an IMCS consisting of In-Memory Compression Units (IMCUs). Each IMCU is immutable — populated from the buffer cache and associated with a Snapshot Metadata Unit (SMU) that tracks changes. Merging updates requires creating a new IMCU incorporating the SMU's changes.

SQL Server integrates the Hekaton in-memory OLTP engine with a Column Store Index (CSI). Unlike Oracle's immutable IMCUs, SQL Server's CSI is updateable — changes are appended to a tail structure indexed by a B-tree, periodically merged into compressed column segments. Infrequently accessed columns are automatically compressed and spilled to disk.

Strengths: High analytical throughput (IMCS), high data freshness (in-memory delta access). Weaknesses: Low scalability (single-node memory bound), low isolation (shared resources).

Application fit: Banking and finance — high throughput and freshness required, but transaction volumes fit within single-node memory. The customer base is relatively fixed, so scalability pressure is low.

Architecture B: Distributed Row Store + Column Store Replica

Representatives: TiDB ⁶, F1 Lightning ⁷

This architecture distributes the OLTP workload across a cluster of row-based nodes and asynchronously replicates data to column-store nodes for analytical processing.

TiDB is a Raft-based distributed HTAP database. Data is partitioned into row-based regions distributed across TiKV nodes. A leader node handles read-write transactions. Follower nodes serve read transactions. Critically, one or more learner nodes maintain columnar replicas (TiFlash) for analytical processing. Learner nodes are read-only and don't participate in Raft voting. Data synchronization happens through Raft log replication — the leader asynchronously ships logs to learner nodes, which replay them into a delta merge tree that periodically merges changes into columnar storage.

TiDB uses a global 2-Phase Commit protocol for distributed transactions and a centralized cost-based optimizer that supports cross-engine query processing — routing sub-plans to either TiKV (row store) or TiFlash (column store).

F1 Lightning (Google) integrates a columnar analytical engine with Spanner, Google's globally-distributed OLTP database. A component called changepump uses change data capture to detect new writes, transforms them from row-wise to columnar format, and merges them via a log-structured merge (LSM) process with deduplication and version collapsing.

Strengths: High scalability (distributed OLTP), high isolation (physically separated OLTP and OLAP nodes). Weaknesses: Low freshness (log shipping latency), high merge cost (distributed delta merging).

Application fit: E-commerce at scale — massive concurrent transaction volumes require distributed OLTP, and slight analytical staleness (seconds to minutes) is acceptable for dashboards showing customer retention rates or inventory trends.

Architecture C: Primary Row Store + Distributed In-Memory C-Store

Representative: MySQL Heatwave ⁸

This architecture couples a traditional single-node OLTP engine with a distributed in-memory column store cluster for analytics.

MySQL Heatwave tightly integrates the MySQL database with a distributed IMCS cluster. Transactions execute entirely in MySQL. Frequently accessed columns are loaded into the Heatwave cluster. Complex queries are pushed down to Heatwave for acceleration. Data synchronization occurs automatically in three cases: every 200 milliseconds, when the buffer reaches 64 MB, or when a query needs access to recently updated data.

Heatwave includes an "auto-pilot" service that automates data partitioning, query execution, and scheduling — one of the few production systems with significant autonomous operations capabilities.

Strengths: High isolation (different nodes for OLTP/OLAP), high OLAP throughput and scalability (distributed IMCS). Weaknesses: Medium-to-low freshness (synchronization latency), low OLTP horizontal scalability (single MySQL node).

Application fit: IoT analytics — high analytical throughput for sensor data aggregation, relatively low update frequency means the standalone OLTP node is sufficient.

Architecture D: Primary Column Store + Delta Row Store

Representatives: SAP HANA ⁹, HyPer ¹⁰

This architecture inverts the typical approach: the primary store is columnar (for OLAP), and a delta row store handles OLTP writes.

SAP HANA organizes in-memory data into three layers: L1-delta (row-wise, recent writes), L2-delta (columnar, local dictionary), and Main (compressed column store, global dictionary). When L1-delta reaches a threshold (e.g., 100K tuples), data transforms from row-wise to columnar in L2. L2 then merges into the Main column store using dictionary reconciliation.

HyPer (now part of Tableau/Salesforce) pioneered the copy-on-write approach: forking the OLTP process to create instant consistent snapshots for OLAP queries. The forked process operates on a consistent virtual memory snapshot, allowing analytical queries to read perfectly fresh data. Snapshot creation takes microseconds, but write-heavy workloads cause significant memory pressure from copy-on-write page duplication.

Strengths: Highest data freshness (microsecond snapshots or in-memory delta access), high OLAP throughput (primary column store). Weaknesses: Low OLTP scalability (delta-based transaction processing), low isolation (shared memory pool).

Application fit: Real-time fraud detection — zero tolerance for stale data (fraudulent transactions must be immediately visible to detection algorithms), but transaction volumes are manageable for single-node systems.

HTAP techniques: the engineering beneath the architecture

Beyond the high-level architecture, HTAP systems employ sophisticated techniques at every layer. Understanding these techniques reveals both the engineering depth of current systems and the specific boundaries where they reach their limits.

Data synchronization: the delta merge problem

The critical operational challenge in any dual-store HTAP system is synchronizing the delta — recent changes from the transactional store — into the analytical store. Three approaches exist:

Threshold-based merging (Oracle): Changes accumulate in a heap-table delta. When the delta reaches a size threshold, a background process migrates data to the column store. Fast for writes (append-only), but analytical queries must scan both the column store and the unordered delta to access fresh data, adding overhead proportional to delta size.

Delete-table merging (SQL Server): An index-organized delta (B-tree) maintains recent changes. A bitmap delete table maps row IDs to their column store locations. Merging assigns RIDs, transforms rows to columnar format at those locations, then removes RIDs from the delete table in an atomic transaction. Fast for reads (indexed delta), but writes pay B-tree insertion overhead.

Dictionary-based merging (SAP HANA): Both delta and main store use columnar format with dictionaries. Local dictionaries in the delta are reconciled with the global dictionary in the main store during merging. Efficient when both stores are columnar, but dictionary growth becomes problematic under high update rates.

Query optimization: hybrid scan decisions

HTAP databases face a unique query optimization challenge: for any given query (or sub-plan), should the engine use the row store, the column store, or both? This decision — the hybrid scan — adds a dimension to the already complex query optimization search space.

Rule-based approaches (Oracle) follow simple heuristics: "column first, row later." If columns exist in the IMCS, scan them there; fall back to the row store for columns not loaded into memory.

Cost-based approaches (TiDB, PolarDB-IMCI) estimate the cost of row scans, index scans, and column scans, selecting the minimum-cost access path. TiDB's optimizer considers the cost for columnar scan, row scan, and index scan, but currently only for scan operations — it doesn't model the cost of cross-engine data exchange.

The recent Metis system ¹¹ generates HTAP-aware hybrid plans by also modeling the cost of delta scanning and data synchronization — the first optimizer to explicitly account for freshness-performance trade-offs in plan selection.

Resource scheduling: the unsolved optimization

How should a system allocate CPU threads, memory, and I/O bandwidth between OLTP and OLAP workloads?

Workload-driven scheduling monitors throughput and adjusts parallelism: when OLAP is saturating CPU, reduce OLAP threads and increase OLTP threads. This maximizes throughput but ignores freshness.

Freshness-driven scheduling ¹² switches between execution modes based on a freshness threshold. When freshness drops below an SLA-defined level, the system moves from isolated execution (high isolation, low freshness) to co-located execution (low isolation, high freshness). This balances freshness and throughput but causes performance fluctuations from mode switching.

No existing system implements holistic scheduling — simultaneously optimizing workload performance, resource allocation, data freshness, and cost. The HTAP survey identifies this as an open research problem.

Why HTAP is necessary but insufficient for AI

HTAP represents the most sophisticated attempt to evolve database architecture beyond the strict OLTP/OLAP separation. It's real engineering, deployed at scale, solving real problems. And for its intended use cases — real-time dashboards over transactional data, fraud detection, operational reporting — it works.

But HTAP was designed to serve the same consumer as BI: a human being, or a human-designed query, asking analytical questions about transactional data. The architectural innovations — dual stores, delta synchronization, hybrid scans, resource scheduling — all optimize for the interaction between structured transactions and structured analytics.

AI agents require at least five capabilities that HTAP architectures don't address:

Limitation 1: HTAP is structured-data-only

All four HTAP architectures — row stores, column stores, and their synchronization mechanisms — handle structured, tabular data. None addresses semi-structured (JSON, nested arrays), unstructured (documents, images, audio), or vector (embeddings) data.

An AI agent reasoning about a customer needs to JOIN transactional data (HTAP's domain) with unstructured documents, vector embeddings, and graph-structured metadata. HTAP's dual-store model (row + column) would need to become a multi-store model (row + column + document + vector + graph) with synchronization mechanisms across all formats. No HTAP system proposes this.

Limitation 2: HTAP is read-mostly for analytics

The HTAP analytical workload is overwhelmingly read-only: SQL queries that scan, aggregate, join, and filter. The transactional workload does the writing. This separation — transactions write, analytics read — is fundamental to how HTAP manages the freshness-isolation trade-off.

AI agents write back. An agent that processes a customer interaction doesn't just read the customer's data — it writes a recommendation, updates a case status, logs its reasoning, and stores evaluation metrics. These writes need transactional guarantees, governance, and lineage tracking. HTAP's analytical store is not designed for mixed read-write agent workloads.

Limitation 3: HTAP quality is freshness, not correctness

HTAP benchmarks (CH-benchmark, HTAPBench, HyBench ¹³) measure three things: OLTP throughput (tpmC), OLAP throughput (QphH), and data freshness. The question they answer is: "how fast can you analyze how-recent data?"

AI systems need a different quality model: "is the AI output correct, safe, and appropriate?" This requires evaluating model predictions against ground truth, assessing retrieval relevance, checking for hallucinations, monitoring for bias, and tracking end-to-end from data to decision. HTAP's freshness metric is necessary (stale data causes bad AI outputs) but radically insufficient.

Limitation 4: HTAP governance is RBAC

Every HTAP system uses standard database access control — roles, privileges, grants — for governance. This is the same RBAC model we critiqued in Part 1. HTAP doesn't introduce any new governance primitive for agent workloads: no capability-based access, no action-level authorization, no delegation models, no compositional safety reasoning.

Limitation 5: HTAP is manually configured

Deploying an HTAP system requires choosing an architecture (A, B, C, or D), configuring the column store (which columns to load, how much memory to allocate), tuning delta merge thresholds, setting freshness SLAs, configuring resource scheduling parameters, and setting up synchronization between stores. MySQL Heatwave's auto-pilot service is a notable exception — it automates some of these decisions — but most HTAP deployments require significant manual tuning by experienced DBAs.

AI-native infrastructure needs to be autonomous: self-configuring based on observed workload patterns, self-healing when components fail, self-tuning as workloads evolve. The HTAP survey identifies "holistic scheduling" as an open problem — and that's just one dimension of the autonomous operations challenge.

The HTAP lesson: necessary evolution, insufficient destination

HTAP teaches us that even the most sophisticated evolution of existing database architectures — driven by a decade of research, implemented by every major vendor, validated by production deployments — reaches a boundary when the workload fundamentally changes.

The boundary isn't capability — HTAP systems are impressively capable. The boundary is assumption. HTAP assumes structured data, human consumers, read-only analytics, RBAC governance, and manual configuration. AI agents violate every one of these assumptions.

The transition from BI to AI requires not just another HTAP-style bridge between two workload types, but a fundamentally new architecture designed for the workload patterns that AI agents actually generate. That architecture needs to handle multiple data modalities, support bidirectional agent data flow, provide semantic quality monitoring, implement agent-aware governance, and operate autonomously.

In Part 4, we'll quantify how badly the current infrastructure is failing AI — not with theoretical arguments, but with data from every major analyst firm and academic research institution that has studied the question.

References

Next: Part 4: The AI Readiness Crisis

Previous: Part 2: The Database Landscape Is Shifting

This post is part of State-Of the Data Infrastructure 2026, an eight-part series by The Sciencer Company.

M. Pezzini, D. Feinberg, N. Rayner, and R. Edjlali. "Hybrid Transaction/Analytical Processing Will Foster Opportunities For Dramatic Business Innovation." Gartner, 2014. ↩
C. Zhang, G. Li, J. Zhang, X. Zhang, J. Feng. "HTAP Databases: A Survey." arXiv:2404.15670, April 2024. ↩ ↩² ↩³ ↩⁴
T. Lahiri et al. "Oracle Database In-Memory: A Dual Format In-Memory Database." ICDE, 2015. ↩
P.-A. Larson et al. "Real-Time Analytical Processing with SQL Server." VLDB, 2015. ↩
V. Raman et al. "DB2 with BLU Acceleration: So Much More Than Just A Column Store." VLDB, 2013. ↩
D. Huang et al. "TiDB: A Raft-based HTAP Database." VLDB, 2020. ↩
J. Yang et al. "F1 Lightning: HTAP as a Service." VLDB, 2020. ↩
"MySQL Heatwave: Real-time Analytics for MySQL Database Service." Oracle, 2024. ↩
V. Sikka et al. "Efficient Transaction Processing in SAP HANA Database: The End of A Column Store Myth." SIGMOD, 2012. ↩
T. Neumann, T. Mühlbauer, A. Kemper. "Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems." SIGMOD, 2015. ↩
H. Song et al. "Rethink Query Optimization in HTAP Databases." PACMMOD, 2023. ↩
A. Raza et al. "Adaptive HTAP Through Elastic Resource Scheduling." SIGMOD, 2020. ↩
C. Zhang, G. Li, T. Lv. "HyBench: A New Benchmark for HTAP Databases." VLDB, 2024. ↩