Database Processing Models
Databases exist to serve workloads, and the most consequential architectural decisions in data infrastructure flow from which workloads a system was designed to handle. For decades, two workload types — transactions and analytics — were served by entirely different systems built on entirely different assumptions. The HTAP movement attempted to unify them. Embedded analytics is challenging whether analytics even needs a server. And at the frontier, the database research community is questioning whether the workload taxonomy itself needs to expand.
Understanding these processing models — not as abstract categories, but as concrete engineering trade-offs with real systems making real decisions — reveals both the achievements and the limitations of current infrastructure. Every model described here encodes assumptions about who (or what) is consuming the data. Those assumptions determine what works and what breaks when AI agents arrive.
B-Tree Index
A self-balancing tree data structure that maintains sorted data and allows searches, insertions, and deletions in O(log n) time. B-trees (and their variant B+ trees, where all data lives in leaf nodes connected by pointers) are the default index structure in virtually every OLTP database: PostgreSQL, MySQL, Oracle, SQL Server, SQLite.
A B-tree of order m has nodes with up to m children and m-1 keys. Internal nodes store keys that act as signposts directing searches left or right. Leaf nodes store the actual data (or pointers to data rows). Because nodes are sized to match disk pages (typically 4–16 KB), a B-tree traversal requires one disk read per level — a tree with a branching factor of 500 and 4 levels can index 500⁴ = 62.5 billion records with only 4 disk reads per lookup.
In the HTAP context, B-trees appear in delta store indexing. SQL Server's delete-table merge strategy uses a B-tree index on the delta to accelerate lookup of recent changes during analytical reads. TiDB's TiKV storage engine uses a variant where data is organized as sorted key-value pairs with B-tree-like range partitioning across distributed nodes (Regions).
The trade-off: B-trees optimize for point lookups and small range scans — the bread and butter of OLTP. For full-table analytical scans, B-trees provide no benefit (you're reading everything anyway) and their per-row indirection adds overhead compared to sequential columnar reads. This is why analytical engines use data skipping (file-level min/max statistics) rather than row-level indexes: the access granularity matches the workload.
CH-benCHmark
A hybrid benchmark specifically designed to evaluate HTAP database systems by running TPC-C (transactional) and TPC-H-derived (analytical) workloads simultaneously against the same data. The "CH" stands for the combination of TPC-C and TPC-H.
Traditional benchmarks measure one workload type in isolation: TPC-C measures transaction throughput (tpmC — transactions per minute of type C), TPC-H measures analytical query throughput (QphH — queries per hour at size H). Neither captures the interaction effects that define HTAP performance: does running analytical queries degrade transaction throughput? How stale is analytical data during concurrent transactional load?
CH-benCHmark addresses this by running both workloads concurrently and measuring three dimensions: OLTP throughput (tpmC under concurrent OLAP load), OLAP throughput (analytical query completion rate under concurrent OLTP load), and data freshness (how quickly transactional changes become visible to analytical queries). Later extensions — HTAPBench, HyBench — added more sophisticated freshness metrics and workload variation, but CH-benCHmark established the paradigm.
For AI infrastructure evaluation, even CH-benCHmark is insufficient. It measures structured-data freshness between two well-defined workload types. AI agents introduce multi-modal data, mixed read-write analytical patterns, concurrent agent interactions, and quality dimensions (correctness, safety, relevance) that no existing benchmark captures. The absence of an AI-native data infrastructure benchmark is itself a signal of how new this problem space is.
Composable Database Building Blocks
An emerging architectural philosophy where database systems are assembled from reusable, interchangeable, high-performance components rather than built as monolithic systems. Instead of each database building its own query parser, optimizer, execution engine, and storage engine from scratch, composable building blocks provide standardized modules that can be mixed and matched.
Three projects exemplify this trend. Velox (Meta) is a C++ execution engine library designed to be embedded in any data processing system — it handles vectorized expression evaluation, memory management, and type systems. Apache DataFusion is a query processing framework written in Rust that provides SQL parsing, logical/physical planning, and columnar execution — used as the foundation for InfluxDB's query engine and Apache Arrow's ecosystem. Apache Calcite provides query planning and optimization as a standalone library — used by Apache Hive, Apache Drill, Apache Flink, and others as their optimizer without bundling it with a specific execution engine.
The Cambridge Report on Database Research discusses whether databases should aim for "POSIX-style specs for the interfaces between components" — standardized interfaces between optimizer, execution engine, storage engine, and storage format that would allow true plug-and-play assembly. The report also floats the radical idea that composable systems could use natural language interfaces between components: "a query optimizer might receive an NL description of a query, and translate it into an NL operator plan which an executor agent consumes." The authors are appropriately skeptical about ambiguity, but the proposal signals how fundamentally the field's thinking is shifting.
For AI-native infrastructure, composability is architecturally essential. An AI agent's query might require a columnar scan (DataFusion), a vector similarity search (specialized engine), a graph traversal (metadata engine), and a document parse (LLM inference) — all within a single logical operation. No monolithic database provides all of these. Composable building blocks make it feasible to build a system that routes each sub-plan to the optimal specialized engine, governed by a unified optimizer.
Concurrency Control
The set of mechanisms a database uses to ensure that concurrent transactions produce correct results — as if they had executed sequentially — without requiring actual sequential execution (which would destroy throughput).
Concurrency control is the core challenge of multi-user databases. Without it, concurrent transactions can produce anomalies: dirty reads (reading uncommitted data that may be rolled back), non-repeatable reads (the same query returning different results within one transaction), phantom reads (new rows appearing between two identical queries), and write skew (two transactions each making individually valid changes that together violate a constraint).
Two dominant paradigms exist. Lock-based concurrency control (traditional) acquires locks on data before accessing it. Shared locks allow concurrent reads; exclusive locks block all other access. Two-Phase Locking (2PL) guarantees serializability: transactions acquire all locks before releasing any. The cost is contention — transactions block each other, reducing throughput under high concurrency. Deadlocks (circular lock dependencies) require detection and victim selection.
Multi-Version Concurrency Control (MVCC) maintains multiple versions of each record. Readers see a consistent snapshot without acquiring locks; writers create new versions rather than overwriting. This eliminates read-write contention (readers never block writers, writers never block readers) at the cost of version management overhead (old versions must be garbage-collected, and long-running transactions can prevent cleanup).
Modern analytical platforms universally use MVCC or snapshot-based isolation. OLTP systems increasingly combine MVCC with optimistic concurrency (validate at commit time rather than locking at access time). For AI workloads where multiple agents read and write concurrently, the choice of concurrency control directly affects whether agents see consistent data, whether their writes conflict, and how those conflicts are resolved.
Copy-on-Write Snapshots
A technique for creating instantaneous, consistent database snapshots by exploiting the operating system's virtual memory copy-on-write mechanism. When the database process forks (creates a child process), the child inherits a bit-for-bit identical view of all memory pages. The OS doesn't physically copy any memory — both processes share the same physical pages. Only when one process modifies a page does the OS transparently create an independent copy of that specific page.
HyPer (now part of Tableau/Salesforce) pioneered this approach for HTAP databases. The OLTP process forks to create a consistent snapshot for OLAP queries in microseconds — orders of magnitude faster than any traditional snapshot mechanism (log-based snapshots take milliseconds to seconds). The OLAP process reads from its frozen snapshot while the OLTP process continues processing transactions, modifying pages, and triggering copy-on-write duplication only for changed pages.
The cost is directly proportional to write intensity. Under read-heavy workloads, few pages are modified and memory overhead is minimal. Under write-heavy workloads (the scenario HTAP is supposed to handle), many pages are modified, causing extensive physical duplication. HyPer's benchmarks show up to 40% OLTP throughput degradation under concurrent heavy analytical load — the OS spends significant time copying modified pages.
This technique illustrates a broader principle in HTAP design: achieving perfect freshness (the OLAP snapshot is exactly as fresh as the fork instant) comes at the direct cost of isolation (the OLTP process bears memory overhead from the concurrent snapshot). There is no configuration that avoids this trade-off; it's a physical constraint of shared-memory architectures.
Database Virtualization
A concept from the Cambridge Report on Database Research describing a system where "a single database front-end automatically provisions and routes queries to the best infrastructure." Instead of humans choosing which database engine, cluster size, or configuration to use for each workload, the virtualization layer observes query characteristics and automatically routes each query to the optimal backend.
A virtualized database front-end would receive a SQL query and determine: is this an OLTP point lookup? Route to the row-store engine on a small, low-latency instance. Is this an analytical aggregation? Route to the columnar engine on a large, high-throughput cluster. Is this a vector similarity search? Route to the embedding index. Is this a mixed query spanning multiple data types? Decompose into sub-plans and route each to the appropriate engine, then merge results.
This is the database research community's theoretical formulation of what practitioners call autonomous infrastructure provisioning. It represents a paradigm shift from "the DBA decides which system to use" to "the system decides how to serve each query." The practical implications are enormous: organizations currently maintain multiple database systems (OLTP database, analytical warehouse, vector database, document store) with manual routing logic. Database virtualization would collapse these into a single logical interface with automatic backend selection.
No production system fully implements database virtualization today. The closest approximations are TiDB's cross-engine optimizer (routing sub-plans to TiKV or TiFlash) and Snowflake's automatic warehouse scaling. But these operate within a single vendor's ecosystem, not across heterogeneous engines — a much harder problem.
Declarative Infrastructure as a Service
A concept from the Cambridge Report describing systems where "declarative interfaces are used for specifying more than queries but also the infrastructure upon which systems run, with search and optimization systems that attempt to allocate infrastructure in the most cost-effective way."
The declarative paradigm has a long history of success in data systems. SQL (1970s) declared what data you want, not how to get it — the query optimizer determines the execution plan. dbt (2016) declared what transformations to apply, not when or in what order — the DAG scheduler determines execution order. Terraform (2014) declared what infrastructure to provision, not how to provision it — the provider plugins handle the API calls.
Declarative Infrastructure as a Service extends this pattern to the full data operations stack: declare your workload characteristics (data volume, query patterns, latency requirements, cost constraints, compliance mandates), and the system determines the optimal infrastructure allocation. How much compute to provision, which storage tier to use, when to scale up or down, how to distribute data across nodes, when to trigger compaction — all determined by the system based on declared objectives.
This is conceptually distinct from auto-scaling (which reacts to current load) or Infrastructure as Code (which still requires humans to specify the infrastructure). Declarative infrastructure optimizes the infrastructure based on workload observation and objective functions. It's the research community's formulation of what the Autonomous DataOps thesis calls self-configuring infrastructure.
Delta Merge
The critical synchronization operation in HTAP systems that propagates recent changes from the transactional write buffer (the "delta") into the analytical column store. This is the mechanical core of every dual-store HTAP architecture — the operation that converts row-oriented transactional writes into column-oriented analytical storage.
Three primary strategies exist, each making different trade-offs:
Threshold-based merging (Oracle): Changes accumulate in an unordered heap-table delta. When the delta reaches a configured size or time threshold, a background process migrates data to the column store. Writes are fast (append-only, no index maintenance) but analytical queries must scan both the column store and the unordered delta to ensure freshness. Scan overhead grows linearly with delta size, creating tension between merge frequency (which competes for resources) and query performance (which degrades with larger deltas).
Delete-table merging (SQL Server): A B-tree-indexed delta maintains recent changes with fast lookup. A bitmap "delete table" maps row IDs to their locations in the column store. During merge, the system assigns RIDs to new rows, transforms them to columnar format, places them at the mapped locations, then atomically removes the RIDs from the delete table. Reads are faster (indexed delta enables direct lookup rather than full scan) but writes pay B-tree insertion overhead.
Dictionary-based merging (SAP HANA): Both the delta and the main store use columnar format, each with value dictionaries. The delta uses fast-build local dictionaries; the main store uses a globally optimized dictionary with better compression. During merge, local dictionary values are translated to global dictionary codes. Efficient when both stores are already columnar, but dictionary growth becomes problematic under sustained high update rates — each merge must reconcile an expanding local dictionary with the global one.
The delta merge problem is fundamentally about managing the impedance mismatch between two physical storage layouts. Every HTAP system that maintains dual stores must solve it, and every solution introduces overhead that doesn't exist in single-store systems.
DuckDB
An in-process analytical database management system (sometimes called "the SQLite for analytics") designed for fast analytical query execution without the overhead of a client-server architecture. Created by Mark Raasveldt and Hannes Mühleisen at CWI (Centrum Wiskunde & Informatica) — the same research institute that produced MonetDB, a pioneer of columnar database architecture.
DuckDB runs as a library embedded within the application process — no server to deploy, no network communication, no separate infrastructure to manage. It reads directly from Parquet files, CSV files, and even other databases (via its scanner extensions), executes columnar analytical queries using a fully vectorized engine, and returns results — all within the calling process's memory space.
Architecturally, DuckDB is notable for several innovations. Its morsel-driven parallelism divides data into fixed-size chunks ("morsels") that are dynamically distributed across threads, achieving near-linear scaling on multi-core machines. Its vectorized execution processes data in batches optimized for CPU cache utilization and SIMD instructions. Its zero-copy integration with Apache Arrow enables efficient data exchange with Python (pandas, polars), R, and other analytics ecosystems.
DuckDB represents a philosophically different approach to analytics infrastructure. Instead of sending data to a remote warehouse over the network, DuckDB brings the query engine to where the data already lives — on the user's machine, in the application's process, on the data lake. This "embedded analytics" paradigm challenges the assumption that analytical processing requires dedicated server infrastructure.
The Cambridge Report notes the broader significance of projects like DuckDB and Polars (a Rust-based DataFrame library with similar embedded philosophy): they suggest a future where the database landscape is not a small number of monolithic platforms but a heterogeneous ecosystem of specialized, composable components deployed where workloads demand.
F1 Lightning
Google's HTAP system that integrates a columnar analytical engine with Spanner, Google's globally distributed, strongly consistent OLTP database. F1 Lightning represents Architecture B (Distributed Row Store + Column Store Replica) at Google's scale.
The architecture centers on a component called changepump — a change data capture mechanism that detects new writes to Spanner, extracts them, transforms them from row-oriented to columnar format, and streams them to the analytical store. The analytical store uses a log-structured merge (LSM) process with two key optimizations: deduplication (removing superseded versions of the same record) and version collapsing (consolidating multiple updates to the same record into a single current-state representation).
F1 Lightning's design choices reflect Google's priorities: Spanner provides the strongest consistency guarantees of any distributed database (external consistency via TrueTime), and F1 Lightning preserves these guarantees in analytical reads at the cost of replication latency. The changepump introduces seconds of lag between a Spanner write and its availability in the analytical store — a conscious trade-off favoring isolation and consistency over minimal freshness.
For the broader HTAP taxonomy, F1 Lightning demonstrates that even the world's most sophisticated distributed database infrastructure accepts the freshness-isolation trade-off rather than eliminating it. If Google — with TrueTime, custom hardware, and a team that includes the inventors of distributed consensus — can't achieve both perfect freshness and perfect isolation, it's a strong signal that this trade-off is fundamental, not merely an engineering challenge.
Freshness-Isolation Trade-off
The fundamental, irreducible constraint in HTAP systems: improving data freshness (how quickly transactional changes become visible to analytical queries) degrades performance isolation (how effectively transactional and analytical workloads avoid interfering with each other), and vice versa.
This trade-off arises from physics, not from design failure. OLTP and OLAP workloads have incompatible resource profiles. OLTP needs low-latency CPU cycles for individual record operations. OLAP needs high-bandwidth memory and I/O for bulk scanning. Both compete for the same physical resources (CPU, memory, cache, I/O bandwidth). Sharing resources more tightly enables fresher data transfer but increases mutual interference. Separating resources more completely improves isolation but introduces synchronization latency.
The spectrum of production systems illustrates this precisely:
| System | Freshness | Isolation | Mechanism |
|---|---|---|---|
| HyPer | Microseconds | Low (up to 40% OLTP degradation) | Shared memory, copy-on-write snapshots |
| SAP HANA | Milliseconds | Low-medium (shared memory pool) | Three-layer delta merge |
| Oracle In-Memory | Milliseconds | Medium (IMCS contention) | IMCU + SMU synchronization |
| MySQL Heatwave | 200ms–seconds | High (separate cluster) | Buffer-based sync, auto-pilot |
| TiDB | 10ms–1000ms | High (separate nodes) | Raft log replication to learner nodes |
| F1 Lightning | Seconds | Very high (separate systems) | CDC changepump + LSM merge |
No system achieves both high freshness and high isolation. This is the HTAP lesson that matters most for AI infrastructure: even the most sophisticated evolution of database architecture encounters hard physical constraints when workloads have fundamentally different resource requirements. AI agents — with workload profiles that differ even more dramatically from both OLTP and OLAP than they differ from each other — will encounter the same class of constraint at a higher order of complexity.
HTAP (Hybrid Transactional/Analytical Processing)
A database architecture that handles both transactional (OLTP) and analytical (OLAP) workloads in a single system, with real-time or near-real-time data freshness between the two. The term was coined by Gartner analyst Donald Feinberg in 2014 to describe systems that would eliminate the ETL gap — the hours-long delay between a transaction occurring and that data becoming available for analytics.
HTAP is not a single architecture but a family of four distinct approaches, each making different trade-offs along the freshness-isolation spectrum:
Architecture A: Primary Row Store + In-Memory Column Store (Oracle In-Memory, SQL Server, DB2 BLU). Row store for OLTP durability, in-memory column store (IMCS) for OLAP acceleration. Changes accumulate in a delta and merge into the IMCS. High freshness, high analytical throughput, but single-node memory constrains scalability and shared resources reduce isolation. Best fit: banking and finance.
Architecture B: Distributed Row Store + Column Store Replica (TiDB, F1 Lightning). Distributed OLTP across row-based nodes, asynchronous replication to column-store nodes for analytics. High scalability and isolation (physically separated nodes), but replication latency (10–1000ms) reduces freshness and distributed delta merging adds cost. Best fit: e-commerce at scale.
Architecture C: Primary Row Store + Distributed In-Memory Column Store (MySQL Heatwave). Traditional single-node OLTP coupled with a distributed in-memory column store cluster. High isolation and OLAP throughput, but single-node OLTP limits horizontal scalability and synchronization adds latency. Best fit: IoT analytics.
Architecture D: Primary Column Store + Delta Row Store (SAP HANA, HyPer). Inverts the typical approach — the primary store is columnar for OLAP, with a delta row store absorbing OLTP writes. Achieves the highest data freshness (microsecond snapshots in HyPer) but the lowest isolation and OLTP scalability. Best fit: real-time fraud detection.
HTAP represents the most sophisticated evolution of database architecture for dual-workload support. Its insufficiency for AI lies not in capability but in assumption: HTAP assumes structured data only, human-speed consumers, read-only analytics, RBAC-based governance, and manual configuration. AI agents violate all five assumptions simultaneously.
Hybrid Scan
A query optimization technique specific to HTAP databases where the query planner determines, for each sub-plan within a query, whether to use the row store, the column store, or both. This adds a dimension to the already complex query optimization search space — beyond the traditional choices of table scan vs. index scan vs. hash join, the optimizer must also choose which physical store to read from.
Rule-based approaches (Oracle) follow heuristics: "column first, row later." If the requested columns exist in the In-Memory Column Store, scan them there; fall back to the row store for columns not loaded into memory. Simple, predictable, but unable to account for varying data distributions or workload characteristics.
Cost-based approaches (TiDB, PolarDB-IMCI) estimate the cost of each access path — row scan, index scan, and column scan — and choose the minimum-cost option. TiDB's optimizer considers the storage-specific costs for TiKV (row) and TiFlash (columnar) and routes each sub-plan accordingly. However, current cost models typically don't account for cross-engine data transfer costs, which can dominate for join-heavy queries.
Freshness-aware approaches (Metis) extend the cost model to include the cost of delta scanning and data synchronization. When the delta is large (many recent changes not yet merged into the column store), scanning the column store alone gives stale results while including the delta adds overhead. Metis is the first optimizer to explicitly model this freshness-performance trade-off in plan selection, allowing users to specify freshness requirements as optimization constraints.
Hybrid scan research extends naturally to the AI-native domain: a cross-modal query planner would need to route sub-plans not just between row and column stores, but between structured engines, vector indexes, document stores, and AI inference services — with cost estimation across all modalities.
In-Memory Column Store (IMCS)
A storage structure that maintains a columnar representation of data entirely in RAM for fast analytical processing. In HTAP systems, the IMCS serves as the analytical acceleration layer alongside a row-based transactional store, enabling analytical queries to scan data at memory bandwidth rather than disk bandwidth.
Oracle's IMCS uses In-Memory Compression Units (IMCUs) — immutable, compressed columnar segments populated from the row store's buffer cache. Each IMCU is associated with a Snapshot Metadata Unit (SMU) that tracks which rows have been modified since the IMCU was created. Analytical queries scan the IMCU for base data and consult the SMU for recent changes. Merging requires creating entirely new IMCUs that incorporate the changes — an expensive operation that trades write-path cost for read-path simplicity.
SQL Server's IMCS uses an updateable Column Store Index (CSI). Changes are appended to a "tail" structure indexed by a B-tree. When the tail grows beyond a threshold, a background "tuple mover" process compresses the tail data and merges it into the main CSI. Infrequently accessed column segments are automatically compressed and, under memory pressure, evicted to disk. This approach handles updates more efficiently than Oracle's immutable model but introduces scan complexity (queries must check both the main CSI and the tail).
The fundamental trade-off between these implementations mirrors the broader HTAP design space: immutable structures (Oracle) are simpler to scan and compress but expensive to update; updateable structures (SQL Server) handle writes more efficiently but add complexity to reads and compaction. Both are bounded by available memory — the IMCS must fit in RAM, limiting the analytical dataset size on single-node systems.
Lakebase
Databricks' operational database capability, launched as a preview feature, that adds PostgreSQL-protocol OLTP functionality to the Databricks platform. Lakebase aims to bring transactional workloads (application backends, operational stores) into the Databricks ecosystem, allowing operational data to be immediately available for analytics without ETL.
Lakebase supports PostgreSQL-compatible SQL, row-level CRUD operations, and ACID transactions — capabilities that the analytical-first Databricks platform previously lacked. Critically, Lakebase supports branching: creating isolated copies of a database for development, testing, or experimentation without affecting production.
However, Lakebase's branching applies to the PostgreSQL database layer, not to Delta Lake tables or the broader lakehouse. You can branch a Lakebase operational database; you cannot branch a Delta Lake analytical table. This distinction matters because the data versioning needs of AI teams span the entire data estate — feature tables, training datasets, model artifacts, evaluation results — not just the operational store.
Lakebase represents Databricks' attempt to converge toward full-stack data infrastructure from the analytical side (adding OLTP to their OLAP-first platform), mirroring Snowflake's Unistore/Hybrid Tables initiative (adding OLTP to their warehouse). Both are building the missing half of their platforms, constrained by the architectural decisions made when the primary workload was BI analytics.
Learner Node
A specialized node type in Raft-based distributed databases that receives the replication log and maintains a replica of the data, but does not participate in the consensus voting protocol. Learner nodes are read-only observers — they follow the leader's log but cannot become leaders or affect the quorum required for write commits.
TiDB uses learner nodes as the architectural bridge between OLTP and OLAP processing. The standard Raft group (leader + followers) handles transactional workloads on row-oriented TiKV nodes. One or more learner nodes receive the same Raft log but replay it into TiFlash — a columnar storage engine optimized for analytical queries. Because learner nodes don't participate in voting, adding OLAP capacity doesn't affect OLTP consensus latency or availability. And because they receive the complete Raft log, they maintain a consistent (though slightly delayed) view of transactional data.
The learner node concept elegantly solves several HTAP challenges simultaneously. Isolation: OLAP processing on learner nodes doesn't compete for resources with OLTP on leader/follower nodes. Consistency: the Raft log guarantees that learner replicas are consistent snapshots (at their replication point) of the transactional data. Scalability: learner nodes can be added independently to scale analytical capacity. The trade-off, as always, is freshness: the asynchronous log shipping introduces 10–1000ms of replication lag between a transaction committing on the leader and its visibility on the learner.
Log-Structured Merge Tree (LSM Tree)
A data structure optimized for write-heavy workloads, used as the storage engine in many modern databases (RocksDB, LevelDB, Cassandra, HBase, TiKV, CockroachDB). LSM trees achieve high write throughput by buffering writes in memory and periodically flushing them to disk as sorted, immutable files.
The architecture has two components. An in-memory memtable (typically a sorted data structure like a red-black tree or skip list) absorbs all writes. When the memtable reaches a size threshold, it's flushed to disk as an immutable SSTable (Sorted String Table) — a file of sorted key-value pairs. Over time, multiple SSTables accumulate at different "levels" on disk. A background compaction process periodically merges SSTables from one level into the next, combining sorted runs, removing deleted entries (tombstones), and producing larger, more compact files.
The key insight: writes are always sequential (append memtable, flush sequentially, merge sequentially), never random — achieving write throughput close to raw disk bandwidth. Reads are slower because a query might need to check the memtable plus multiple SSTables across levels before finding the most recent version of a key. Bloom filters at each level mitigate this by quickly ruling out SSTables that don't contain the target key.
In the HTAP context, F1 Lightning uses LSM-based storage for its analytical column store, with CDC-driven ingestion from Spanner flowing through a changepump into an LSM merge process with deduplication and version collapsing. TiDB's TiKV storage engine uses RocksDB (an LSM-tree implementation) as its on-disk storage, providing the write performance needed for distributed OLTP.
The trade-off is called write amplification: each record is written once to the memtable, once when flushed as an SSTable, and potentially multiple times during compaction as it moves through levels. A single logical write can result in 10–30x physical writes — acceptable when write throughput is the bottleneck (OLTP workloads), problematic for SSD longevity and energy consumption at scale.
Massively Parallel Processing (MPP)
A database architecture where queries are distributed across many independent processors, each with its own memory and (optionally) storage. Each processor handles a portion of the data and computation; results are combined to produce the final answer. MPP is the execution model behind virtually every modern analytical database.
The classic MPP architecture is shared-nothing: each node owns a portion of the data (horizontal partitioning) and processes queries against its local partition. Nodes communicate only to exchange intermediate results (shuffles) or produce final aggregations. Teradata, Greenplum, and early Redshift exemplified shared-nothing MPP — scaling by adding nodes, each with co-located storage and compute.
Modern cloud MPP systems have evolved toward shared-storage (disaggregated) architectures where all nodes read from shared object storage. Snowflake, BigQuery, and Databricks' serverless SQL use this model: ephemeral compute nodes read data from S3/GCS/ADLS, process their assigned portions, and produce results. The advantage over shared-nothing is elastic scaling — nodes can be added or removed without redistributing data.
MPP systems achieve their speed through partition parallelism (each node processes a data subset), pipeline parallelism (different stages of the query plan execute concurrently), and SIMD parallelism (vectorized execution within each node). A query scanning 1 TB across 100 nodes effectively reads 10 GB per node — completing in seconds rather than the minutes a single node would require.
The limitation: MPP architectures are optimized for read-heavy analytical queries with predictable parallelism. They struggle with write-heavy workloads (coordinating distributed writes requires expensive protocols like 2PC), point lookups (the overhead of distributing a single-row query across many nodes exceeds the benefit), and highly skewed data (where one node has disproportionately more work, creating a bottleneck). These are precisely the patterns AI agent workloads introduce.
MVCC (Multi-Version Concurrency Control)
A concurrency control method where the database maintains multiple physical versions of each data record simultaneously, enabling readers to access consistent snapshots without blocking or being blocked by concurrent writers.
When a transaction writes a row, it creates a new version tagged with the transaction's ID, rather than overwriting the existing version. When a transaction reads a row, it sees the most recent version that was committed before the reading transaction started — not any in-progress changes from concurrent transactions. This provides snapshot isolation naturally: each transaction sees a consistent point-in-time view of the database.
MVCC eliminates the most painful concurrency bottleneck in lock-based systems: read-write contention. In a locking system, a long-running analytical query that reads a table blocks all writes to that table (or vice versa). In MVCC, the analytical query reads from a frozen snapshot while transactions continue writing new versions — neither blocks the other.
The costs of MVCC are version storage (old versions consume space until garbage-collected), version visibility checks (each read must determine which version is visible to the current transaction), and garbage collection overhead (a background process must identify and remove versions no longer visible to any active transaction). Long-running transactions are particularly expensive because they prevent garbage collection of all versions created after the transaction started.
MVCC is the dominant concurrency model in modern databases: PostgreSQL, Oracle, MySQL (InnoDB), Snowflake, CockroachDB, TiDB. In HTAP systems, MVCC is central — it's the mechanism that enables analytical queries to read consistent snapshots while transactions continue writing without mutual interference.
MySQL Heatwave
Oracle's HTAP extension to MySQL that couples the traditional MySQL single-node OLTP engine with a distributed in-memory column store cluster for analytical processing. Heatwave represents Architecture C in the HTAP taxonomy: Primary Row Store + Distributed In-Memory Column Store.
Transactions execute entirely within the MySQL server. Frequently accessed columns are loaded into the Heatwave cluster — a set of in-memory nodes that store data in columnar format and execute analytical queries in parallel. Data synchronization between MySQL and Heatwave occurs automatically under three conditions: every 200 milliseconds, when the synchronization buffer reaches 64 MB, or on-demand when a query needs access to data changed since the last sync.
Heatwave's most distinctive feature is its auto-pilot service — one of the few production systems with significant autonomous operations capabilities. Auto-pilot uses ML models to automate data placement decisions (which columns to load into the IMCS, how to partition them across nodes), query execution optimization (choosing between MySQL and Heatwave execution paths), and resource scheduling (allocating memory and compute across concurrent workloads). Oracle claims Heatwave achieves performance competitive with or exceeding Snowflake and Redshift on standard analytical benchmarks, at lower cost.
For the Autonomous DataOps thesis, Heatwave's auto-pilot is significant as a production proof point that ML-driven infrastructure automation is feasible and effective — albeit limited to a single system's internal configuration rather than the full data estate automation that autonomous DataOps envisions.
OLAP (Online Analytical Processing)
A database workload pattern characterized by complex queries over large datasets: aggregations, joins, window functions, group-by operations, and scans across millions or billions of rows. OLAP queries typically touch many rows but few columns, execute for seconds to minutes, tolerate latency, are overwhelmingly read-only, and are run by relatively few concurrent users (analysts, BI tools, scheduled reports).
OLAP systems — Snowflake, BigQuery, Redshift, ClickHouse, DuckDB, Databricks SQL — are optimized for this pattern through a reinforcing stack of design choices: columnar storage (skip irrelevant columns), data skipping via partition/file statistics (skip irrelevant row groups), vectorized execution (process columns in CPU-cache-friendly batches with SIMD), massive parallelism (distribute scans across many nodes), and disaggregated storage (scale compute independently of data volume).
The BI Assumption is, at its core, the assumption that OLAP is the primary analytical workload. This assumption cascades through every layer: storage is optimized for OLAP access patterns (column-oriented, aggregation-friendly), governance assumes OLAP consumers (human analysts with organizational roles), quality models target OLAP correctness (schema conformance, not semantic accuracy), and infrastructure is sized for OLAP concurrency (tens of complex queries, not thousands of simple agent requests).
AI agents generate workloads that violate OLAP assumptions: they mix reads and writes, require point lookups alongside scans, need sub-second latency for interactive reasoning, generate thousands of concurrent requests, and access multi-modal data (structured, unstructured, vector) — none of which OLAP systems were designed to handle efficiently.
OLTP (Online Transaction Processing)
A database workload pattern characterized by short, frequent operations that read and write individual records: INSERT a new order, UPDATE a customer address, SELECT a single account balance, DELETE a canceled reservation. OLTP queries touch one or a few rows across all (or most) columns, complete in single-digit milliseconds, require strict ACID guarantees, and serve thousands to millions of concurrent users.
OLTP systems — PostgreSQL, MySQL, Oracle, SQL Server, CockroachDB, TiDB — are optimized for this pattern: row-oriented storage (all columns of a row stored together for fast single-record access), B-tree indexes (O(log n) point lookups), write-ahead logging (durability — every write is logged before applied, enabling crash recovery), MVCC or lock-based concurrency (correctness under thousands of concurrent transactions), and buffer pools (cache frequently accessed pages in memory to avoid disk I/O).
The OLTP/OLAP separation is the most fundamental architectural divide in data infrastructure. It exists because the optimal physical design for each workload is incompatible: OLTP needs row storage, low-latency point access, write optimization, and fine-grained concurrency control; OLAP needs columnar storage, high-bandwidth sequential scans, read optimization, and coarse-grained parallelism. No single physical design serves both well — which is why the industry maintains separate systems, connects them with ETL/ELT pipelines, and why HTAP's attempt to bridge the divide requires dual stores with complex synchronization.
AI agents blur the OLTP/OLAP boundary. An agent processing a customer request might: read a customer record (OLTP-style point lookup), scan historical interactions (OLAP-style aggregation), retrieve similar cases (vector similarity search), generate a response (LLM inference), write back a recommendation (OLTP-style insert), and log its reasoning chain (append-only write) — all within a single logical operation. No existing processing model is designed for this heterogeneous workload profile.
Query Compilation
A query execution strategy where the database compiles a SQL query into native machine code (via LLVM or similar) before executing it, rather than interpreting the query plan at runtime. Compiled queries run as optimized machine instructions — eliminating the interpretation overhead (function call dispatch, type checking, virtual method resolution) that dominates execution time for simple operations.
HyPer pioneered query compilation for analytical databases, demonstrating that compiled queries could execute 10–100x faster than interpreted queries for CPU-bound operations. The approach generates a single tight loop of machine code for each query pipeline, with no per-tuple function call overhead. This is particularly effective for OLTP-style operations (where per-tuple overhead is a large fraction of total work) and for analytical queries on cached data (where I/O is not the bottleneck).
The trade-off is compilation latency: generating and optimizing machine code takes milliseconds to seconds — negligible for long-running analytical queries but significant for OLTP transactions that should complete in single-digit milliseconds. Systems that use query compilation typically maintain a code cache (reusing compiled plans for repeated query patterns) and may fall back to interpretation for one-shot queries.
Modern systems often take a hybrid approach. DuckDB uses vectorized interpretation (processing batches of values through pre-compiled operators) rather than per-query compilation — achieving most of the performance benefit without compilation latency. Databricks Photon uses a vectorized native engine written in C++ that combines vectorized execution with some compilation techniques. The debate between full compilation (HyPer-style) and vectorized interpretation (DuckDB-style) remains active in the database research community.
Raft Consensus Protocol
A distributed consensus algorithm — designed by Diego Ongaro and John Ousterhout at Stanford as a more understandable alternative to Paxos — that ensures a cluster of servers agrees on the same ordered sequence of operations, even when some servers fail.
Raft operates through leader election. One node is elected leader; all others are followers. The leader handles all client writes, appends them to its local log, and replicates each entry to follower nodes. A write is considered committed once a majority (quorum) of nodes have acknowledged it — meaning the cluster tolerates the failure of up to ⌊(n-1)/2⌋ nodes (a 5-node cluster survives 2 failures). If the leader fails, followers detect the absence of heartbeats and trigger a new election. The new leader's log is guaranteed to contain all committed entries.
In data infrastructure, Raft is the consensus protocol behind TiDB's distributed OLTP layer. Data is partitioned into Regions (contiguous key ranges), each managed by an independent Raft group. The Region's leader handles read-write transactions; followers serve stale reads. This per-Region consensus enables TiDB to scale horizontally — adding nodes adds capacity for more Regions.
TiDB's HTAP innovation is the learner node: a Raft participant that receives the log but doesn't vote. Learner nodes maintain TiFlash columnar replicas for analytical processing without affecting OLTP consensus latency. This architectural pattern — using a consensus protocol's replication mechanism for HTAP data synchronization — is unique to TiDB and represents one of the more elegant solutions to the freshness-isolation trade-off.
Raft's guarantees are strong (linearizable reads from the leader, consistent replication) but come at the cost of latency: every write requires a network round-trip to a majority of nodes. For geographically distributed clusters, this latency can reach tens of milliseconds per write — which is why Google's Spanner uses GPS/atomic-clock-based TrueTime rather than Raft for global consistency.
Resource Scheduling (HTAP)
The problem of allocating CPU threads, memory bandwidth, I/O capacity, and cache space between OLTP and OLAP workloads running in the same HTAP system. This is one of the most challenging operational problems in HTAP and remains an open research question.
Workload-driven scheduling monitors throughput metrics and adjusts resource allocation dynamically. When OLAP queries saturate CPU, the scheduler reduces OLAP thread parallelism and redirects threads to OLTP processing. When OLTP transaction queues grow, analytical query admission is throttled. This approach maximizes aggregate throughput but ignores data freshness — under sustained OLAP load, delta merges may be starved of resources, causing analytical queries to read increasingly stale data.
Freshness-driven scheduling defines freshness SLAs and switches execution modes to meet them. When data freshness drops below a threshold (e.g., delta age exceeds 5 seconds), the system shifts from isolated execution (OLTP and OLAP on separate resource pools) to co-located execution (shared resources enabling faster delta propagation). This maintains freshness guarantees but causes performance fluctuations as the system oscillates between execution modes.
Holistic scheduling — simultaneously optimizing workload throughput, resource utilization, data freshness, energy consumption, and cost — does not exist in any production system. The HTAP survey by Zhang et al. specifically identifies this as an open research problem. The difficulty is that these objectives are partially conflicting: maximizing throughput may waste energy, minimizing freshness lag may degrade isolation, and minimizing cost may sacrifice performance.
The unsolved nature of holistic HTAP scheduling is a microcosm of the broader autonomous operations challenge. If scheduling resources between two well-characterized workloads (OLTP and OLAP) within a single system is still an open problem, scheduling across the full heterogeneous data estate — multiple engines, multiple data modalities, multiple consumer types (humans, agents, models, pipelines) — requires a fundamentally different approach.
SAP HANA
SAP's in-memory database platform and one of the most important HTAP implementations (Architecture D: Primary Column Store + Delta Row Store). HANA inverts the typical HTAP model by making the columnar store primary and using row-oriented delta stores for transaction absorption.
HANA organizes in-memory data into three layers:
L1-delta (row-oriented): The write buffer. New inserts and updates land here as row-oriented records, optimized for fast individual writes. When L1-delta reaches a configured threshold (e.g., 100,000 tuples), data is transformed from row-wise to columnar format and moved to L2-delta.
L2-delta (columnar, local dictionary): An intermediate columnar buffer. Data here uses column-oriented storage with locally-scoped value dictionaries — fast to build but not globally optimal. As L2-delta grows, a merge operation reconciles local dictionaries with the global dictionary and migrates data to the main store.
Main store (compressed columnar, global dictionary): The primary analytical store. Data is heavily compressed using a globally-optimized dictionary encoding, where each distinct value is assigned a single integer code used across the entire column. Global dictionaries enable integer-based comparisons during query execution (comparing dictionary codes rather than string values), dramatically accelerating scans and joins.
The three-layer architecture manages the HTAP trade-off by decoupling write ingestion speed (L1-delta accepts writes at OLTP speed) from analytical read performance (main store provides maximum scan throughput) with an intermediate reconciliation layer (L2-delta batches the format conversion). The cost is memory consumption — all three layers reside in RAM — and merge overhead, which increases under sustained high update rates as the dictionaries at each layer diverge.
HANA's commercial significance extends beyond its technical architecture: as the foundation of SAP's enterprise platform (S/4HANA), it runs the transaction processing and analytics for many of the world's largest corporations, making its architectural decisions — and limitations — practically relevant at enormous scale.
Shared-Nothing vs. Shared-Disk Architecture
Two fundamental distributed database architectures that define how nodes in a cluster share (or don't share) storage resources.
In a shared-nothing architecture, each node has its own processor, memory, and storage. Data is horizontally partitioned across nodes — each node "owns" a subset. Nodes communicate only via network messages to exchange intermediate results during distributed queries. Teradata, Greenplum, and early Hadoop exemplified shared-nothing: scaling meant adding nodes, each with its own storage. The advantage is elimination of storage contention. The disadvantage is data redistribution costs when scaling (rebalancing partitions) and the need to ship data between nodes for joins across partitions.
In a shared-disk (or shared-storage) architecture, all nodes access the same storage layer — typically cloud object storage. Nodes are stateless (or have local caches but no persistent local data). Snowflake pioneered shared-storage for analytical databases: all compute nodes read from S3, with local SSD caching for hot data. Scaling compute is instant (add nodes without data movement); scaling storage is independent (data grows in S3 without touching compute).
The modern cloud data stack has overwhelmingly converged on shared-storage architecture — often called "disaggregated storage and compute." Snowflake, BigQuery, Databricks serverless, ClickHouse Cloud, and most new analytical databases use this model. The shared-nothing model persists primarily in on-premises deployments and in OLTP databases where co-located storage provides the lowest latency for single-record operations.
For HTAP, the architecture choice has direct implications: shared-nothing HTAP (TiDB) partitions data across nodes with Raft-based replication; shared-storage HTAP would allow OLTP and OLAP engines to access the same data without replication, but introduces object storage latency into the transaction path — a trade-off most OLTP workloads can't accept.
Snowflake Unistore / Hybrid Tables
Snowflake's approach to adding OLTP capabilities to its analytical platform. Hybrid Tables are a table type that supports low-latency, single-row operations (point lookups, inserts, updates, deletes) alongside Snowflake's standard analytical query capabilities — effectively bringing transactional workload support into a system designed from the ground up for analytics.
Hybrid Tables use a different storage engine than standard Snowflake tables. Where standard tables use immutable, columnar micro-partitions optimized for bulk analytical scans, Hybrid Tables use a row-oriented storage engine that supports individual record mutations with ACID guarantees, secondary indexes for fast lookups, and referential integrity constraints (unique keys, foreign keys). The data is replicated into Snowflake's standard columnar storage for analytical access.
Unistore is the broader initiative: a "unified" transactional and analytical platform. The vision is that operational applications (web backends, APIs, real-time services) write to Hybrid Tables with OLTP-grade latency, while analytical queries and dashboards read from the same data through Snowflake's standard analytical engine.
In the HTAP taxonomy, Unistore approximates Architecture A (Primary Row Store + In-Memory Column Store) in a cloud-native context. The same freshness-isolation trade-offs apply: tight coupling between the OLTP and OLAP stores enables fresher analytics but introduces potential resource contention.
Unistore represents Snowflake's convergence from the warehouse side toward full-stack infrastructure, mirroring Databricks' convergence from the lakehouse side via Lakebase. Both are building the OLTP half they were missing — with the inherited constraints of platforms designed when the primary workload was analytical BI.
TiDB
A distributed NewSQL database that provides HTAP capabilities through a Raft-based architecture (Architecture B: Distributed Row Store + Column Store Replica). Developed by PingCAP, TiDB is one of the most architecturally sophisticated open-source HTAP systems and the most prominent implementation of the learner-node pattern.
TiDB's architecture has three components. TiDB Server is a stateless SQL layer that parses queries, generates execution plans, and routes operations to the appropriate storage engine. TiKV is a distributed key-value store using Raft consensus, providing row-oriented OLTP storage with strong consistency. Data is partitioned into Regions (default 96 MB key ranges), each managed by an independent Raft group. TiFlash is a columnar storage engine that maintains read-only analytical replicas via Raft learner nodes.
The cross-engine optimizer is TiDB's key innovation for HTAP. When a query arrives, the optimizer evaluates the cost of executing each sub-plan against either TiKV (row store, optimal for point lookups and small range scans) or TiFlash (column store, optimal for full scans and aggregations). Sub-plans are routed independently — a single query might read some tables from TiKV and others from TiFlash, with results merged in the TiDB SQL layer. A global Two-Phase Commit protocol coordinates distributed transactions across TiKV nodes.
TiDB's freshness trade-off is explicit: Raft log replication from leader to learner nodes introduces 10–1000ms of latency. Analytical queries on TiFlash see data that is milliseconds to seconds behind the transactional state. For most HTAP use cases (dashboards, reporting, trend analysis), this staleness is acceptable. For AI agents that need to act on the most recent data, it may not be.
Two-Phase Commit (2PC)
A distributed transaction protocol that ensures atomicity across multiple independent participants — either all participants commit a transaction, or all abort it. 2PC is the standard protocol for coordinating distributed writes in systems where data spans multiple nodes, databases, or services.
Phase 1 — Prepare: The coordinator sends a prepare request to all participants. Each participant acquires locks, writes preliminary changes to its log, and responds with either "vote-commit" (ready to commit) or "vote-abort" (unable to proceed).
Phase 2 — Commit/Abort: If all participants voted to commit, the coordinator sends a commit message and each participant finalizes the transaction. If any participant voted to abort, the coordinator sends abort to all participants, and each rolls back its changes.
The protocol guarantees atomicity — no partial commits — but introduces costs. Latency: two network round-trips (prepare + commit) are required, with participants holding locks between phases. For geographically distributed participants, this latency can be significant. Blocking: if the coordinator fails between phases, participants are stuck holding locks until the coordinator recovers (the "blocking problem" — solved by Three-Phase Commit at the cost of additional round-trips, though 3PC is rarely used in practice).
TiDB uses a Percolator-inspired 2PC protocol for distributed transactions across TiKV nodes. BigQuery uses 2PC for multi-statement transactions. Any distributed database that supports cross-partition writes uses some form of 2PC or an equivalent protocol (e.g., Paxos Commit in Spanner).
For AI infrastructure, 2PC becomes relevant when agent operations span multiple data stores — writing a result to one table, logging a reasoning chain to another, and updating a governance record in a third, atomically. The latency overhead of 2PC is acceptable for human-speed operations but becomes a bottleneck at agent-speed concurrency.
Vectorized Execution
A query processing technique where the execution engine processes data in batches (vectors) of typically 1,024–10,000 values at a time, rather than one row at a time (the volcano/iterator model) or materializing entire columns in memory (full materialization).
The traditional volcano model processes one tuple per function call: the SUM operator calls the filter operator, which calls the scan operator, each returning a single row. Each call involves function dispatch, type checking, and branch prediction overhead. For analytical queries processing billions of rows, this per-tuple overhead dominates execution time — the engine spends more time on control flow than on actual computation.
Vectorized execution amortizes this overhead across batches. The scan operator returns a vector of 1,024 values. The filter operator applies a predicate to all 1,024 values in a tight loop. The SUM operator aggregates all 1,024 values. Each function call processes a batch, reducing the per-value overhead by 1,000x.
Vectorized execution also exploits modern CPU architecture. SIMD instructions (Single Instruction, Multiple Data) process 4, 8, or 16 values in a single CPU cycle — but only on contiguous, homogeneous data arrays, which vectors provide naturally. Cache efficiency: vectors are sized to fit in L1/L2 CPU cache (1,024 × 8 bytes = 8 KB per column vector), avoiding the cache misses that plague row-at-a-time processing. Branch prediction: tight loops over homogeneous data have highly predictable branches, avoiding the pipeline stalls that irregular tuple-at-a-time access patterns cause.
DuckDB's entire architecture is built around vectorized execution. Snowflake, Databricks Photon, ClickHouse, and Velox all use vectorized engines for their analytical query paths. The technique is one of the primary reasons modern analytical databases can process billions of rows per second on commodity hardware — a 100x improvement over the row-at-a-time engines of the previous generation.
Write Amplification
A phenomenon where a single logical write operation results in multiple physical writes to storage — amplifying the total I/O, wear on solid-state drives, and energy consumption beyond what the application-level write suggests.
Write amplification appears throughout data infrastructure at multiple levels:
LSM trees: A record written to the memtable is flushed to L0 as an SSTable, then compacted to L1, then L2, and so on. Each compaction level rewrites the data. Total amplification of 10–30x is common in write-heavy workloads: a 1 GB logical write may produce 10–30 GB of physical I/O.
Copy-on-write table formats: In Iceberg, Delta Lake, and Hudi (CoW mode), updating a single record requires rewriting the entire data file (typically 50–500 MB) containing that record. If a file has 1 million rows and you update one, the write amplification is 1,000,000x at the record level — though the operation is framed as a file-level rewrite, not a record-level one.
HTAP delta merge: Every record written to the OLTP delta must eventually be merged into the analytical column store — effectively writing each record twice (once to the delta, once to the column store), plus the overhead of the merge operation itself (dictionary reconciliation, compression, file rewriting).
Snowflake micro-partitions: An UPDATE to one row in a micro-partition creates an entirely new micro-partition containing all rows (modified and unmodified). The old micro-partition is marked for garbage collection. For a 100 MB micro-partition with 100,000 rows, updating one row produces ~100 MB of write I/O.
Write amplification matters for AI workloads because agents write frequently: logging reasoning chains, updating state, storing evaluation results, modifying feature values. Each logical agent write triggers potentially massive physical I/O cascades through the storage stack. Understanding and minimizing write amplification is essential for both performance and energy efficiency in AI-native infrastructure.
Part of The Sciencer Wiki. Maintained by The Sciencer Company.