Data Storage Architectures

How data is physically stored determines what you can do with it efficiently — and what you cannot do at all. This is not a theoretical concern. Every query your system runs, every model it trains, every agent action it takes is constrained by decisions made at the storage layer: how bytes are arranged on disk, what metadata is maintained, which access patterns are optimized, and which are penalized.

The modern data stack is built on storage architectures optimized for business intelligence. Columnar formats, aggregation-friendly layouts, metadata designed for human-driven query patterns, and governance models assuming read-dominant access by human analysts. These architectures are genuinely excellent at what they were designed for. Understanding them — in detail, at the level of physical data organization — is essential to understanding both why they work so well for BI and why they struggle with AI workloads that violate every assumption they encode.

This cluster covers the storage primitives: the formats, layouts, table formats, platform-specific implementations, and architectural patterns that form the physical foundation of modern data infrastructure.

ACID Transactions

A set of four properties — Atomicity, Consistency, Isolation, Durability — that guarantee reliable processing of database operations, even in the presence of errors, crashes, or concurrent access.

Atomicity means a transaction either completes entirely or has no effect. If a pipeline writes 10,000 records and fails on record 9,001, atomicity guarantees that none of the 10,000 appear in the table — there are no partial writes. Consistency means every transaction moves the database from one valid state to another, enforcing all defined constraints (data types, uniqueness, foreign keys). Isolation means concurrent transactions don't interfere with each other — two pipelines writing to the same table simultaneously produce the same result as if they ran sequentially. Durability means once a transaction is committed, it survives system failures — the data is persisted to stable storage.

ACID transactions were long considered a property of databases, not data lakes. The data lake era (2012–2018) largely abandoned ACID guarantees in favor of raw file dumps on HDFS or S3 — which is why data lakes became "data swamps." The lakehouse revolution was fundamentally about bringing ACID back to lake storage. Both Delta Lake and Apache Iceberg achieve ACID semantics through metadata-layer transaction logs, enabling atomic commits, snapshot isolation, and conflict detection on top of eventually-consistent object storage (S3, GCS, ADLS).

For AI workloads, ACID matters more than it did for BI. When multiple agents write concurrently, when models depend on feature tables being consistent, and when pipelines must be auditable, ACID guarantees prevent the silent data corruption that currently plagues 70–85% of AI projects.

Apache Hudi

An open-source data lake storage framework (originally "Hadoop Upserts Deletes and Incrementals") that provides ACID transactions, record-level upserts, incremental processing, and data versioning on data lakes. Developed at Uber to solve the specific problem of keeping analytical tables in sync with rapidly changing operational databases — a use case where traditional batch ETL produced tables that were always hours stale.

Hudi distinguishes itself from Delta Lake and Iceberg with two table types. Copy-on-Write (CoW) tables rewrite entire Parquet files on every update — optimizing for read performance at the cost of write amplification. Merge-on-Read (MoR) tables write updates to row-based delta logs and merge them with the columnar base during reads — optimizing for write performance at the cost of read-time merge overhead. This mirrors the HTAP challenge of balancing write efficiency against read performance, but at the table format level.

Hudi also provides native incremental queries: the ability to read only records that changed since a given timestamp, without scanning the full table. Combined with a timeline-based metadata system (every commit, compaction, and cleaning is recorded on the timeline), Hudi supports change-stream consumption patterns that neither Iceberg nor Delta Lake natively replicate with the same efficiency.

In practice, Hudi's adoption is strongest at companies with high-throughput CDC workloads (Uber, Amazon, ByteDance). It has less ecosystem traction than Iceberg or Delta Lake in the broader lakehouse market, partly because its design choices prioritize ingestion-heavy use cases over the general-purpose analytics that dominate enterprise data teams.

Apache Iceberg

An open table format for large-scale analytical datasets, originally developed at Netflix and donated to the Apache Foundation. Iceberg adds a metadata layer on top of file formats like Parquet or ORC, providing schema evolution, hidden partitioning, time travel, snapshot isolation, and engine-agnostic access — without locking users into a specific compute engine.

Iceberg organizes data through a three-level metadata hierarchy. A metadata file (JSON) captures the table's current schema, partition spec, and properties. It points to a manifest list (Avro) that enumerates all manifests for a given snapshot. Each manifest (Avro) lists individual data files with per-file statistics: record counts, column-level min/max values, null counts, and size in bytes. This indirection enables atomic commits (swapping one metadata file pointer), concurrent writes (optimistic concurrency on the metadata file), and efficient partition pruning (evaluating predicates against manifest-level statistics without listing files in object storage).

Hidden partitioning is an underappreciated design decision. Unlike Hive-style partitioning — where partition columns are encoded in directory paths (/year=2025/month=03/) and the partition scheme is visible to and must be known by query authors — Iceberg partitioning is defined in metadata. Users write WHERE event_time > '2025-03-01' and the engine automatically maps the predicate to the correct partitions. The partition scheme can evolve (e.g., from monthly to daily) without rewriting data; new data uses the new scheme and Iceberg's metadata tracks both.

Iceberg's Write-Audit-Publish (WAP) workflow enables table-level branching: write data to a branch, run quality checks, then atomically publish the branch to main. This is the closest any production table format comes to Git-like versioning, though it remains limited to single-table scope and fast-forward merges.

Iceberg's creators (Ryan Blue et al.) were later acquired by Databricks via the Tabular acquisition, making Iceberg a de facto interoperability standard. Snowflake, Databricks (via UniForm), Trino, Flink, Spark, Dremio, and Starburst all support reading and writing Iceberg tables.

Apache ORC

The Optimized Row Columnar file format — a columnar storage format originally developed within the Apache Hive project for Hadoop-based analytics. ORC predates Parquet and was long the default format for the Hive/Hadoop ecosystem.

ORC files are organized into stripes (typically 250 MB, analogous to Parquet's row groups), each containing column data, an index with min/max/sum statistics per column, and a stripe footer with encoding information. A file-level footer stores schema information and stripe-level statistics. ORC supports three levels of data skipping: file-level statistics (skip entire files), stripe-level statistics (skip stripes), and row-level indexes (skip groups of 10,000 rows within a stripe).

ORC natively supports ACID transactions through a delta-file mechanism — something Parquet never attempted. When rows are updated or deleted, ORC writes delta files that record the changes; a reader merges the base data with deltas at read time. This made ORC the only file format that could support Hive ACID tables (INSERT, UPDATE, DELETE) without a separate metadata layer.

In the modern lakehouse ecosystem, Parquet has largely won the format war due to broader engine support and Iceberg/Delta Lake's choice of Parquet as their default data format. ORC remains significant in legacy Hive deployments and as an option for Iceberg tables (Iceberg is format-agnostic and supports both ORC and Parquet as underlying storage).

Apache Parquet

A columnar file format that has become the standard on-disk storage format for analytical data across the modern data stack. Originally developed by Twitter and Cloudera as a joint project, Parquet is now the default data format for Delta Lake, the most common format for Iceberg tables, and the native storage format for Snowflake's micro-partitions (which are Parquet-derived).

Parquet files are organized into row groups (horizontal partitions of 64–512 MB, typically ~128 MB by default), containing column chunks (one per column per row group), containing pages (the unit of compression and encoding, typically 1 MB). Each column chunk stores statistics — min value, max value, null count, distinct count — used for predicate pushdown and data skipping. Each page can be independently compressed and encoded.

The format supports multiple encoding schemes optimized for different data distributions. Dictionary encoding maps repeated values to integer codes (highly effective for low-cardinality columns like status codes or country names). Run-length encoding compresses sequences of identical values. Delta encoding stores the difference between consecutive values (efficient for sorted or nearly-sorted numeric columns like timestamps). Bit-packing uses the minimum number of bits per value. These encodings compound: a column of 10 million country codes might compress from 200 MB to 2 MB through dictionary encoding alone.

Parquet supports nested data structures (maps, lists, structs) via the Dremel encoding described in Google's 2010 paper "Dremel: Interactive Analysis of Web-Scale Datasets." This encoding uses definition and repetition levels to flatten arbitrary nesting into columnar format without losing structural information — enabling columnar storage of JSON-like semi-structured data.

Multiple compression codecs can be applied on top of encoding: Snappy (fast compression/decompression, moderate ratio), ZSTD (high ratio, reasonable speed), GZIP (high ratio, slower), and LZ4 (fastest, lowest ratio). The choice of codec is a latency-vs-storage trade-off configured per column or per table.

Bloom Filter

A probabilistic data structure used for set membership testing. A Bloom filter can tell you with certainty that an element is not in a set, or with high probability that it is — but never with certainty (false positives are possible, false negatives are not).

A Bloom filter is a bit array of m bits, initially all zeros, with k independent hash functions. To add an element, hash it with all k functions and set the corresponding bits to 1. To query, hash the element and check if all k bits are set — if any bit is 0, the element is definitely absent; if all are 1, the element is probably present. The false positive rate depends on the ratio of m (bits) to n (elements) and the number of hash functions k. For a 1% false positive rate with one million elements, a Bloom filter requires approximately 1.2 MB — compared to storing the actual values, which might require 50+ MB.

In data infrastructure, Bloom filters appear at multiple levels. Parquet files can include column-level Bloom filters for high-cardinality columns where min/max statistics are ineffective (e.g., user IDs, UUIDs). Snowflake uses Bloom filters within micro-partition metadata to accelerate point lookups on high-cardinality columns. Delta Lake supports Bloom filter indexes as an optional optimization. HBase and other LSM-tree-based stores use Bloom filters to avoid unnecessary disk reads during point lookups.

The key insight is that Bloom filters trade a small, bounded probability of wasted work (false positives cause unnecessary reads) for the elimination of much larger quantities of wasted work (true negatives skip entire partitions, files, or disk pages). In a system with thousands of partitions, a Bloom filter check that takes microseconds can save seconds of I/O.

Columnar Storage

A data storage layout where values from the same column are stored contiguously on disk, as opposed to row-oriented storage where all values from the same row are stored together. Columnar storage is the foundational design decision underlying modern analytical data infrastructure.

Consider a table with 200 columns and 100 million rows. An analytical query SELECT SUM(revenue) FROM sales WHERE region = 'EMEA' touches only 2 of 200 columns. In a row-oriented store, the engine reads all 200 columns for every row because they're physically adjacent — discarding 99% of the data after reading it. In a columnar store, the engine reads only the revenue and region columns, skipping the other 198 entirely. For a typical analytical query touching 5–10 columns, columnar storage reduces I/O by 95–97%.

Columnar layouts also compress dramatically better than row storage. Adjacent values in a column tend to be of the same type and similar distribution: a column of dates contains only dates, a column of country codes contains only country codes. This homogeneity enables encoding techniques (dictionary, run-length, delta, bit-packing) that achieve 5–20x compression ratios. Row storage, by contrast, interleaves heterogeneous types (string, integer, float, boolean, timestamp) within each row, limiting compression effectiveness.

The fundamental trade-off: row-oriented storage is faster for point lookups (fetching all columns of a single record is one sequential read) and inserts (writing a complete row is one sequential write). Columnar storage is faster for analytical scans (reading few columns across many rows) and aggregations (column-level operations like SUM, AVG, COUNT benefit from memory locality and SIMD vectorization).

This storage-level trade-off is the physical origin of the OLTP/OLAP split. It's why transactional databases (PostgreSQL, MySQL, Oracle) use row storage while analytical systems (Snowflake, BigQuery, Redshift, ClickHouse) use columnar storage. It's also why HTAP systems must maintain both layouts and synchronize between them — there is no single physical layout that is optimal for both workload types.

Copy-on-Write

A resource management technique where copies of data are not physically duplicated until one copy is modified. The "copy" is initially just a metadata pointer that shares all physical storage with the original. Only when data in the copy (or original) is modified does the system create independent physical copies of the affected segments.

In data infrastructure, copy-on-write appears at multiple levels. Snowflake's Zero-Copy Cloning creates table, schema, or database clones by duplicating metadata pointers to micro-partitions. Cloning a 10 TB table is nearly instantaneous and costs zero additional storage — until modifications cause divergence. HyPer (now part of Tableau/Salesforce) uses OS-level copy-on-write via fork() to create consistent OLAP snapshots from the OLTP process in microseconds. Apache Hudi's CoW table type rewrites entire Parquet files when records within them are updated — a different application of the same principle at the file level.

The trade-off depends on the write pattern. For read-heavy workloads with infrequent modifications, copy-on-write is nearly free — the metadata-only copy delivers massive space savings. For write-heavy workloads, copy-on-write can create memory pressure (HyPer sees up to 40% OLTP degradation under concurrent heavy writes) or write amplification (Hudi's CoW tables rewrite entire files for single-record updates).

Copy-on-write also underlies the branching semantics of modern data versioning systems: lakeFS creates branches as zero-copy metadata operations, with physical data duplication occurring only when branches diverge.

Data Lake

A centralized storage repository that holds raw data in its native format — structured, semi-structured, and unstructured — at any scale. Unlike a data warehouse, which enforces schema at write time, a data lake stores data first and applies schema at read time (schema-on-read).

Data lakes emerged in the early 2010s around Hadoop's HDFS, with the promise that storing all organizational data in one place — regardless of format or immediate use — would unlock analytical value that purpose-built data warehouses couldn't capture. As Hadoop declined, data lakes migrated to cloud object storage (S3, GCS, ADLS), which offered functionally unlimited capacity at roughly $0.023/GB/month (S3 Standard) — an order of magnitude cheaper than warehouse storage.

The core benefit is flexibility: any data type, any volume, any structure, low cost. The core risk is becoming a "data swamp" — a repository where data is stored but effectively unusable because no one can find it, trust it, or understand it. The swamp problem arises from the absence of the governance, metadata, quality, and transactional guarantees that warehouses provide by default.

Modern data lakes are rarely "just lakes" anymore. Open table formats (Iceberg, Delta Lake, Hudi) overlay ACID transactions, schema enforcement, and metadata management on lake storage. Data catalogs (Unity Catalog, AWS Glue, Atlan) provide discovery and governance. This convergence — adding warehouse capabilities to lake storage — is the lakehouse architecture.

Data Lakehouse

An architecture that combines the low-cost, flexible, multi-format storage of a data lake with the ACID transactions, governance, schema enforcement, and query performance of a data warehouse. The term was coined by Databricks in a 2020 paper and has since become an industry-standard category.

The architecture works by storing data in open file formats (primarily Parquet) on object storage (S3, ADLS, GCS), overlaid with a table format (Delta Lake, Apache Iceberg, Apache Hudi) that adds a transaction log, schema evolution, time travel, and metadata management. Compute engines read and write through the table format layer, which provides warehouse-like guarantees without the warehouse's proprietary storage format and associated vendor lock-in.

The lakehouse thesis is that organizations shouldn't need separate systems — and separate copies of data — for data warehousing (structured analytics), data science (ML workloads), and real-time analytics (streaming). In a traditional architecture, data flows from sources → data lake (raw storage) → data warehouse (curated analytics) → ML feature stores → model training — with each hop creating latency, cost, and potential inconsistency. The lakehouse collapses this into a single copy of data served through different compute engines.

In practice, both Snowflake and Databricks have converged toward lakehouse-like architectures from opposite directions. Snowflake — born as a cloud data warehouse — has added Iceberg table support, unstructured data handling (FILE type, Cortex AI), and Snowpark for data science workloads. Databricks — born as a Spark-based lakehouse — has added SQL Analytics, serverless SQL warehouses, and Lakebase (a PostgreSQL-protocol operational store). Each is building the missing half of the lakehouse, constrained by architectural decisions made when the primary consumer was a BI analyst, not an AI agent.

Data Skipping

An optimization technique where the query engine uses stored metadata to avoid reading data that cannot possibly match the query predicate — "skipping" irrelevant data files, partitions, or row groups entirely.

Data skipping relies on statistics collected at write time and stored in metadata. When a query includes WHERE date = '2025-03-15', the engine checks each file's (or partition's or row group's) metadata to see if the date column's value range includes March 15. If the min/max range for a given file is 2025-01-01 to 2025-01-31, that file is skipped — no bytes are read from storage.

The effectiveness of data skipping depends on data layout. If records are sorted or clustered by the filter column, each file contains a narrow value range and data skipping eliminates 95%+ of files. If records are randomly distributed, every file's range spans the entire value domain and nothing can be skipped. This is why clustering (Snowflake's automatic micro-partition clustering, Databricks' Liquid Clustering, Iceberg's sort orders) matters so much for query performance: it's not about sorting data for its own sake, it's about enabling data skipping.

Multiple statistics support data skipping at different levels: column-level min/max (range filters), null counts (IS NULL/IS NOT NULL), Bloom filters (exact match on high-cardinality columns), and dictionary membership (equality predicates on encoded columns). The more sophisticated the statistics, the more queries can benefit from skipping.

Data skipping is the primary reason analytical query engines can process petabyte-scale tables in seconds: they read only the fraction of data relevant to the query, not the full dataset.

Data Warehouse

A centralized analytical database designed for complex queries across large volumes of structured data. Data warehouses enforce schema at write time, optimize for read-heavy analytical workloads, and provide ACID transactions, governance, and SQL-based querying.

The data warehouse model originated in the late 1980s (IBM's term) and was formalized by Bill Inmon (enterprise data warehouse) and Ralph Kimball (dimensional modeling, star schemas). For three decades, warehouses were the authoritative system for business reporting and analytics. The original on-premises generation (Teradata, Oracle Exadata, IBM Netezza) was replaced by the cloud generation (Snowflake, BigQuery, Amazon Redshift) starting around 2012, with Snowflake's separation of storage and compute becoming the defining architectural innovation.

Cloud data warehouses are columnar, disaggregated (separate storage and compute), elastically scalable, and fully managed. They excel at the workloads they were designed for: dashboards, reports, ad-hoc SQL queries, and BI tool integration. Their governance models — RBAC, row-level security, column masking, data sharing — assume human consumers who access data through structured interfaces.

The challenge is that data warehouses are now being asked to serve workloads they weren't designed for: ML training (which requires full-table scans, not selective queries), AI agent access (which requires machine-speed concurrent reads and writes, not human-latency batch queries), and multi-modal data (which requires documents and images alongside structured tables, not just structured columns). Both Snowflake and Databricks are extending their platforms toward these use cases, but the extensions (Snowflake's FILE type, Databricks' Mosaic AI) sit on top of foundations optimized for BI.

Delta Lake

An open-source storage framework developed by Databricks that brings ACID transactions, scalable metadata handling, and time travel to data lakes. Delta Lake stores data as Parquet files with an additional JSON-based transaction log (_delta_log/) that records every operation as an ordered sequence of actions.

The transaction log is Delta Lake's defining innovation. Every write (INSERT, UPDATE, DELETE, MERGE) creates a new JSON file in the _delta_log/ directory. The log records which Parquet files were added and removed, enabling atomic operations (a commit either fully appears or doesn't), time travel (read the table as of any historical version), and concurrent write conflict detection (optimistic concurrency). Periodically, checkpoint files (Parquet format) consolidate the JSON logs for faster reads.

Key capabilities beyond ACID: Time Travel enables querying historical table versions by version number or timestamp (default 30-day retention, configurable). MERGE operations support record-level upserts (update if exists, insert if not) — critical for CDC ingestion patterns. Schema enforcement rejects writes that don't match the table schema. Schema evolution allows adding columns without rewriting data. Liquid Clustering (replacing the earlier Z-ordering) incrementally reorders data to improve data skipping effectiveness.

Delta Lake's lineage model is a linear sequence of versions — there's no native branching or merging. You can read version N or roll back to version N, but you can't create a branch at version N, make independent changes, and merge them back. This is adequate for BI (where data flows in one direction) but insufficient for the experimentation-heavy workflows AI teams need.

Delta Lake is Databricks' native table format, but Iceberg interoperability via UniForm allows Delta tables to be read as Iceberg tables by external engines — an acknowledgment that the market is converging on Iceberg as the interoperability standard.

Dictionary Encoding

A compression technique where each distinct value in a column is assigned a compact integer code, and the stored data consists of those integer codes rather than the original values. A column of 10 million rows containing 50 distinct country names stores 10 million small integers (a few bits each) plus a 50-entry dictionary, rather than 10 million variable-length strings.

Dictionary encoding is one of the most impactful compression techniques in columnar storage because analytical data is full of low-to-medium cardinality columns: status codes, category labels, region names, department names, currency codes, boolean flags. For a column with 100 distinct values across 100 million rows, dictionary encoding reduces storage by 90–99% depending on the average string length.

Parquet uses dictionary encoding as the default first pass for all column types, falling back to plain encoding only when the dictionary exceeds a size threshold (typically 1 MB per page). This means most columns in a well-structured analytical table are dictionary-encoded by default, with no user intervention.

HTAP systems use dictionary encoding with additional sophistication. SAP HANA's main column store uses a global dictionary — a single dictionary shared across all rows for a column, enabling integer comparisons instead of string comparisons during query execution. HANA's delta store uses local dictionaries per column segment, which are merged into the global dictionary during the merge process. This multi-level dictionary approach balances write performance (local dictionaries are fast to build) against read performance (global dictionaries enable faster scans and joins).

Disaggregated Storage and Compute

An architectural pattern where data storage and query processing are separated into independent, elastically scalable layers connected by network. Storage lives in cloud object storage (S3, GCS, ADLS). Compute lives in ephemeral clusters or warehouses that spin up on demand, process queries, and shut down when idle.

Before disaggregation, analytical databases co-located storage and compute on the same physical servers — a model called "shared-nothing" architecture (Teradata, early Hadoop, Redshift). Scaling storage required adding servers, which also added unused compute; scaling compute also added unnecessary storage. The two resources were coupled, which meant over-provisioning was the norm — organizations paid for compute to store cold data, and paid for storage to have enough query capacity.

Snowflake pioneered disaggregation for cloud data warehousing by placing all data in S3 and running stateless compute clusters that cached hot data locally. This enabled: independent scaling (double storage without touching compute), multi-tenant isolation (different workloads run on separate compute clusters against shared data), and pay-per-query economics (compute charges only when running).

The Cambridge Report on Database Research notes that the industry has "widely adopted the concept of disaggregated storage and compute, enabling a high degree of scalability and flexibility." The trade-off is network latency: accessing data from S3 is ~10–100x slower than local SSD. This makes metadata-driven data skipping, aggressive caching, and intelligent data placement critical — without these optimizations, disaggregated architectures would be prohibitively slow.

Disaggregation is now the consensus architecture for cloud-native analytical systems (Snowflake, BigQuery, Databricks' serverless, ClickHouse Cloud, Firebolt) and is extending into transactional systems (Aurora, Neon, TiDB's compute layer).

FILE Data Type

A column type introduced by Snowflake (GA September 2025) that stores metadata pointers to files in Snowflake stages as first-class column values. FILE columns can appear alongside structured columns in the same table, enabling queries that JOIN structured data with unstructured files.

Combined with Cortex AI functions, FILE columns enable queries like:

SELECT customer_name,
       AI_PARSE_DOCUMENT(contract_pdf, 'renewal_date')
FROM customers
WHERE region = 'EMEA'

This is the first production implementation of treating unstructured data as a peer to structured columns within a warehouse. It represents Snowflake's approach to the multi-modal data problem: extend the relational model with pointers to unstructured content, and use AI functions to bridge the gap at query time.

The limitations reveal the BI-era storage architecture underneath. FILE values cannot appear in GROUP BY, ORDER BY, or CLUSTER BY clauses — because the micro-partition architecture has no concept of ordering, grouping, or clustering non-scalar values. FILE columns are unsupported in Hybrid tables (Snowflake's OLTP-capable tables), Iceberg tables, and External tables. File references become stale if underlying files are moved or modified, with no built-in reference integrity mechanism.

These aren't bugs — they're architectural constraints. The micro-partition storage model was designed for scalar values that can be aggregated, sorted, and compared. Extending it to file references is a bridge, not a native capability. Databricks takes a different approach: Unity Catalog Volumes provide governed file storage, but there is no FILE-equivalent column type in Delta tables — unstructured data enters SQL queries through READ_FILES(), a table-valued function that creates transient DataFrames.

Hot/Cold Tiering

A data management strategy that places frequently accessed ("hot") data on fast, expensive storage (local SSD, high-performance object storage tiers) and infrequently accessed ("cold") data on slower, cheaper storage (archival tiers like S3 Glacier, S3 Infrequent Access, Azure Cool/Archive).

Most data follows a sharp recency curve: the most recent data is queried orders of magnitude more frequently than historical data. In a typical enterprise table, last week's data might receive 80% of queries while data older than 90 days receives less than 5%. Without tiering, all data sits on the same storage tier — paying premium prices for cold data that almost no one accesses.

Cloud data warehouses handle tiering implicitly. Snowflake stores all active data in the same tier and relies on its query engine's caching hierarchy (local SSD cache, remote disk cache) for performance stratification. Databricks relies on Delta Lake's data skipping and caching to avoid reading cold data during hot queries. Neither provides explicit user-controlled tiering within the lakehouse — though Iceberg's metadata layer could theoretically support it.

For AI-native infrastructure, tiering becomes more complex because AI workloads access data differently. A model training run might need the full historical dataset (accessing cold data for extended periods), while an agent serving predictions needs only the latest features (accessing only hot data). Workload-adaptive tiering — where the system observes access patterns across heterogeneous workloads and automatically moves data between tiers — is a capability of the proposed Autonomous DataOps architecture.

Immutable Storage

A storage design where data, once written, is never modified in place. Updates and deletes are handled by writing new records (or new versions of files) and tracking which records are current through metadata.

Immutability is the foundational principle of modern data lake storage. Parquet files are written once and never modified. Iceberg and Delta Lake implement updates by writing new data files and recording the old files as removed in the transaction log — the old files are eventually garbage collected, but they are never overwritten. Snowflake's micro-partitions are immutable: an UPDATE creates new micro-partitions containing the modified data and marks the old partitions as obsolete.

The benefits of immutable storage are significant. Concurrency is simplified: readers never compete with writers for access to the same physical data (no locks required). Time travel is natural: since old versions are retained, reading a historical snapshot is just reading the data files that were active at that point. Failure recovery is straightforward: a failed write leaves the old data intact; only a metadata pointer update commits the change. Auditing is inherent: the sequence of immutable writes creates a complete change history.

The cost is write amplification: updating one record in a 100 MB Parquet file requires rewriting the entire file (or at minimum, the row group containing the record). For write-heavy workloads, this amplification can be substantial. HTAP systems address this with delta stores (row-oriented buffers for recent writes that are periodically merged into the immutable columnar main store) — a compromise between pure immutability and efficient mutation.

Liquid Clustering

A Databricks Delta Lake feature that automatically and incrementally reorders data within files based on specified clustering columns to improve query performance. Liquid Clustering replaced the earlier Z-ordering approach, which required periodic manual OPTIMIZE ZORDER BY commands to rewrite entire tables.

When queries filter on clustering columns, the query engine uses file-level min/max statistics to determine which files might contain matching rows (data skipping). Liquid Clustering improves data skipping effectiveness by reducing the value range overlap between files. In an unclustered table, a filter on date = '2025-03-15' might match 80% of files (because each file contains a wide date range). After clustering on date, the same filter might match only 2% of files.

The "liquid" aspect means clustering is incremental and adaptive. New data is clustered during ingestion and compaction; the system doesn't need to rewrite the entire table. Clustering columns can be changed without rebuilding — the new clustering takes effect on new writes and gradually propagates during background compaction. This is a significant improvement over Z-ordering, which was a full-table rewrite operation.

Liquid Clustering is analogous in purpose to Snowflake's automatic micro-partition clustering (which is transparent and requires no user configuration) and Iceberg's sort orders (which are declarative but require periodic compaction). The difference is in control: Snowflake fully automates clustering, Databricks requires the user to specify clustering columns but automates the ongoing maintenance, and Iceberg requires explicit compaction jobs.

Micro-Partition

Snowflake's fundamental storage unit. Data in Snowflake tables is automatically divided into immutable, compressed, columnar segments of 50–500 MB (uncompressed), typically containing 50,000 to 500,000 rows. Users have no direct control over micro-partition boundaries — Snowflake determines how to distribute rows across partitions based on ingestion order and, optionally, clustering keys.

Each micro-partition maintains a comprehensive metadata profile: min/max values per column, null counts, distinct value counts, Bloom filter membership for high-cardinality columns, and the total number of rows. This metadata enables partition pruning — the single most impactful optimization in Snowflake's query engine. When a query includes WHERE date BETWEEN '2025-01-01' AND '2025-03-31', the engine evaluates this predicate against each micro-partition's date column metadata. Partitions whose date range doesn't overlap the filter are skipped entirely — often eliminating 90–99% of data without reading a single byte of actual content.

Micro-partitions are automatically created during data loading and never modified after creation (immutable storage). Updates and deletes write new micro-partitions and mark old ones as obsolete. Snowflake's automatic clustering service periodically reorganizes micro-partitions to improve the alignment between natural query patterns and partition boundaries.

The automatic management is both a strength and a constraint. Users don't manage partition sizes, layouts, or statistics — eliminating a major operational burden. But the system's clustering and organization decisions are optimized for BI query patterns: range filters, equality predicates, aggregations. The multi-modal, unpredictable access patterns generated by AI agents — combining point lookups, range scans, similarity searches, and cross-modal joins — fall outside the optimization space Snowflake's automatic clustering targets.

Object Storage

Cloud-based storage services (Amazon S3, Google Cloud Storage, Azure Data Lake Storage) that store data as discrete objects — files with associated metadata — in a flat namespace. Object storage is the physical foundation of every modern data lakehouse.

Unlike block storage (fixed-size blocks on disk, used by traditional databases) or file storage (hierarchical directory trees, used by POSIX filesystems), object storage uses a flat key-value model where each object has a unique key, binary content, and user-defined metadata. "Directory" paths (/data/raw/events/2025/03/) are an illusion — they're just key prefixes. This flat model enables massive horizontal scaling (S3 reportedly serves trillions of objects) and extreme durability (99.999999999% / "eleven nines" for S3 Standard).

Pricing operates on two dimensions: storage (typically $0.023/GB/month for S3 Standard, $0.004/GB/month for Glacier) and operations (per-request charges for PUT, GET, LIST operations). This pricing model fundamentally shapes data architecture: storing petabytes is cheap, but accessing data frequently or listing millions of files is expensive. This is why table formats (Iceberg, Delta Lake) maintain their own file manifests rather than relying on object storage LIST operations, which are both slow and costly at scale.

The trade-off vs. local storage is latency. Reading from local SSD takes ~0.1 ms; reading from S3 takes ~10–100 ms (first-byte latency). This 100–1000x latency gap makes metadata-driven data skipping and multi-level caching essential for acceptable query performance on disaggregated architectures.

Open Table Format

A category of metadata standards — Apache Iceberg, Delta Lake, and Apache Hudi are the three — that add database-like capabilities (ACID transactions, schema evolution, time travel, partition management) to data files stored in open formats (Parquet, ORC) on object storage. Open table formats are the enabling technology of the lakehouse architecture.

Before open table formats, using Parquet files on S3 as an analytical table required external systems (Hive Metastore) to track partition locations, offered no transactional guarantees (concurrent writes could corrupt data), provided no schema enforcement (schema was defined per-reader, not per-table), and had no time travel capability. Open table formats solved these problems by introducing a metadata layer that sits between the compute engine and the data files.

The three formats differ in design priorities. Iceberg prioritizes engine independence, hidden partitioning, and rich metadata for query planning — it's the most "standards-oriented" format with the broadest engine support. Delta Lake prioritizes tight integration with Spark and the Databricks ecosystem, pioneered transaction log-based ACID, and leads in features like Liquid Clustering and UniForm interoperability. Hudi prioritizes high-throughput ingestion, record-level upserts, and incremental consumption — strongest for near-real-time CDC workloads.

The market is converging: Databricks supports Iceberg reads via UniForm, Snowflake added native Iceberg support, and all three formats use Parquet as their primary underlying data format. The competition is increasingly about ecosystem and compute engine optimization rather than format-level differences.

Row-Oriented Storage

A data storage layout where all values from a single row (record) are stored contiguously on disk. When the engine reads a row, all columns come together in a single sequential I/O operation.

Traditional relational databases — PostgreSQL, MySQL, Oracle, SQL Server (OLTP instances) — use row-oriented storage because transactional workloads are row-centric: read a customer record, insert an order, update a status. These operations touch all (or most) columns of a small number of rows. Row-oriented storage is optimal for this pattern: fetching all columns of WHERE id = 12345 is one sequential disk read, and inserting a new row is one sequential write.

Row storage performs poorly for analytical queries that scan many rows but need few columns. A query SELECT AVG(price) FROM orders reads and discards every column except price from every row — wasting 90%+ of I/O bandwidth on data the query doesn't need. This inefficiency grows linearly with the number of columns per row.

The row vs. column storage trade-off is the physical origin of the OLTP/OLAP divide. Transactional systems (row storage) and analytical systems (columnar storage) exist as separate products because no single physical layout is optimal for both. HTAP databases attempt to bridge this by maintaining synchronized row and column stores — but the underlying physical constraint remains. This trade-off also explains why the HTAP survey identifies the freshness-isolation problem as fundamental: synchronizing between two storage layouts inevitably introduces either staleness (async replication) or interference (shared resources).

Schema Evolution

The ability to change a table's schema — adding columns, dropping columns, renaming columns, changing data types — without rewriting the underlying data files or interrupting active workloads. Schema evolution is a core feature of all three open table formats (Iceberg, Delta Lake, Hudi) and of cloud data warehouses.

In the era before open table formats, schema changes to data lake tables were disruptive operations. Adding a column to a Hive table might require rewriting all Parquet files (to include the new column in every file's metadata), or accepting that old files had a different schema than new files (leading to errors or null-filled results depending on the reader).

Iceberg handles schema evolution through metadata: the schema is stored in the metadata file, and each data file records which schema version it was written with. Old files with the old schema are read by mapping old column IDs to current schema positions — no data rewriting needed. Delta Lake handles schema evolution through the transaction log: ALTER TABLE ADD COLUMN adds a log entry, and existing Parquet files are read with the new column filled as null. Both approaches achieve the same result (non-disruptive schema changes) through different mechanisms.

For AI workloads, schema evolution is critical because data shapes change frequently. Feature engineering adds new columns. Model iterations introduce new data types. Agent interactions generate new output fields. A storage layer that requires downtime or full rewrites for schema changes creates an unacceptable bottleneck in the rapid experimentation cycles AI development demands.

Schema-on-Read vs. Schema-on-Write

Two opposing philosophies for when and where data structure is enforced, representing the fundamental design tension between data lakes (schema-on-read) and data warehouses (schema-on-write).

Schema-on-write enforces structure at ingestion time. Data must conform to a predefined schema before it enters the system. If a record doesn't match the schema (wrong type, missing required column, constraint violation), it's rejected. Data warehouses use schema-on-write: you CREATE TABLE with a schema, and every INSERT must match. This guarantees data quality and consistency but reduces flexibility — you must know the schema before you have the data, and changing the schema is disruptive.

Schema-on-read applies structure at query time. Data is stored in its raw form (JSON, CSV, log files, documents), and the schema is interpreted by the reader when the data is queried. Data lakes traditionally use schema-on-read: dump anything into S3, then let the query engine figure out the structure. This maximizes flexibility — any data can be stored immediately, and multiple interpretations of the same data are possible — but provides no quality guarantees. A downstream query might fail because the data doesn't have the expected structure, and the failure happens at read time rather than write time.

The lakehouse architecture attempts to occupy the middle ground: schema is enforced at write time (by the table format's schema enforcement) but can evolve without rewriting data (via schema evolution). This provides the safety of schema-on-write with much of the flexibility of schema-on-read. However, schema enforcement in lakehouses applies only to tabular structure — it doesn't validate semantic correctness, data quality, or content integrity, which is the deeper challenge for AI workloads.

Snapshot Isolation

A concurrency control mechanism where each transaction reads a consistent "snapshot" of the database as it existed at the time the transaction began — regardless of changes made by concurrent transactions. Snapshot isolation provides the illusion that each transaction runs alone, without the performance penalty of serializable isolation (full locking).

In practical terms: if Transaction A starts at time T1 and reads a table, it sees exactly the data that existed at T1 — even if Transaction B commits changes to that table at T2 (where T2 > T1, while A is still running). A never sees B's changes until A starts a new transaction.

Snapshot isolation is the default concurrency model for all major analytical platforms. Snowflake provides snapshot isolation through micro-partition versioning: each query sees a consistent set of micro-partitions determined by the transaction's start time. Iceberg provides snapshot isolation through its snapshot-based metadata hierarchy: each query resolves to a specific snapshot, which points to a fixed set of manifest lists, manifests, and data files. Delta Lake provides snapshot isolation through its transaction log: each query reads a specific log version that determines the visible set of Parquet files.

The limitation: snapshot isolation doesn't prevent all concurrency anomalies. Write skew — where two concurrent transactions each read data, make decisions, and write results that together violate a constraint neither individually violated — is possible under snapshot isolation but not under serializable isolation. For BI workloads (overwhelmingly read-only), this rarely matters. For AI workloads where multiple agents read and write concurrently based on data-driven reasoning, write skew can produce logically inconsistent states.

Transaction Log

A persistent, ordered record of every write operation performed on a table or database, enabling ACID transactions, crash recovery, time travel, and conflict detection. Transaction logs are the mechanism by which open table formats bring database-like guarantees to object storage.

Delta Lake's transaction log (_delta_log/) stores one JSON file per commit, each recording which data files were added and which were removed. Files are numbered sequentially (000000.json, 000001.json, ...), and periodic checkpoint files (Parquet format) consolidate the log for faster reads. To determine the current state of a table, a reader finds the latest checkpoint and applies subsequent JSON commits. Conflict detection works by checking whether concurrent commits touched overlapping data.

Iceberg's metadata serves the same role through a different structure. Instead of a sequential log, Iceberg maintains a metadata file (JSON) that points to the current snapshot. Each snapshot references a manifest list, which references manifests, which reference data files. A new commit atomically swaps the metadata file pointer (using atomic rename in HDFS, or conditional PUT in S3). The metadata chain provides time travel by following historical snapshot references.

Apache Hudi's timeline records every action (commits, compactions, cleanings, rollbacks) as files in a .hoodie/ metadata directory. Each action has a timestamp, action type, and state (requested, inflight, completed). The timeline enables incremental consumption: a reader can request "all changes since timestamp X" — a capability neither Delta nor Iceberg provides natively with equivalent efficiency.

For AI-native infrastructure, the transaction log is the foundation of data versioning. Every proposed extension — branching, merging, ML-driven quality gates — builds on the ability to record, replay, and reason about the history of changes to a data estate.

Unity Catalog Volumes

Databricks' mechanism for managing unstructured and non-tabular files (PDFs, images, videos, audio, model artifacts) within the Unity Catalog governance framework. Volumes provide governed file storage with a FUSE filesystem interface — meaning files can be accessed through standard filesystem paths (/Volumes/catalog/schema/volume_name/file.pdf) from notebooks, jobs, and ML workloads.

Volumes exist in two types. Managed Volumes store files in Databricks-managed cloud storage, with full lifecycle management (deletion of the volume deletes the files). External Volumes reference files in existing cloud storage locations, providing governance and access control without moving data.

Volumes serve a similar role to Snowflake's stages (external storage locations accessible from the platform) but with tighter governance integration: files inherit Unity Catalog's access controls, audit logging, and lineage tracking. However, there is no FILE-equivalent column type in Delta tables. Unstructured data in Volumes enters SQL queries through READ_FILES(), a table-valued function that creates transient DataFrames, or through ai_query() which can process file content via AI models.

The architectural gap is clear when comparing approaches: Snowflake treats files as first-class column values (FILE type) but with limited operations; Databricks treats files as governed filesystem objects but without native table integration. Neither achieves true multi-modal storage where structured rows and unstructured files participate equally in queries, joins, aggregations, and governance.

Z-Ordering

A multi-dimensional data organization technique that interleaves the bits of multiple column values to create a single sort key, enabling data skipping across multiple filter dimensions simultaneously. Z-ordering (also called Morton ordering) was the primary data organization strategy in Delta Lake before Liquid Clustering replaced it.

The problem Z-ordering solves: traditional sorting on a single column (e.g., date) optimizes data skipping for queries filtered on that column, but doesn't help for queries filtered on other columns (e.g., region or customer_id). Z-ordering interleaves bits from multiple columns to create a space-filling curve that preserves locality across all specified dimensions. The result is moderate data skipping improvement across all Z-ordered columns, rather than optimal skipping on one column and no skipping on others.

In practice, Z-ordering required manual OPTIMIZE ZORDER BY (col1, col2, col3) commands that rewrote entire tables — an expensive, batch operation that could take hours on large tables and interfered with ongoing workloads. Liquid Clustering replaced Z-ordering by providing incremental, background-based multi-column data organization that achieves similar (often better) data skipping without the manual maintenance burden.

Z-ordering remains historically significant as the first widely adopted technique for multi-dimensional data organization in data lakes, and the concept (interleaved multi-dimensional sorting for improved data locality) appears in various forms across storage systems, spatial databases, and even cache line optimization.

Zero-Copy Cloning

A technique — pioneered in production by Snowflake — that creates a logical copy of a table, schema, or entire database by duplicating only the metadata pointers, not the underlying data. The clone initially shares 100% of physical storage with the source.

CREATE TABLE sales_staging CLONE sales;
-- Instantaneous, regardless of table size
-- Zero additional storage cost until modifications

As either the source or clone is modified, only the affected micro-partitions are independently stored. After modifying 5% of rows in the clone, the clone shares 95% of its storage with the source and independently stores only the changed 5%. This is copy-on-write semantics applied at the platform level.

Zero-Copy Cloning enables workflows that would otherwise be prohibitively expensive: creating complete production replicas for development, testing schema changes against real data, generating training dataset snapshots for ML experiments, or creating isolated environments for different teams — all in seconds at near-zero cost.

However, Zero-Copy Cloning is not data versioning in the Git sense. There's no diff between the clone and source. No merge operation to reconcile changes. No branching history tracking how the clone evolved. It's a single point-in-time copy with independent forward evolution. For the experimentation workflows AI development demands — branch, experiment, evaluate, merge if better, discard if worse — full data versioning systems (lakeFS, Nessie) are required.

Part of The Sciencer Wiki. Maintained by The Sciencer Company.