Data Versioning & Branching

Code has Git. Data has… almost nothing comparable. This is not for lack of trying — it's because data versioning is fundamentally harder than code versioning. A source code repository measures in megabytes; a data lake measures in petabytes. A code diff operates on text lines; a data diff operates on billions of typed records across hundreds of columns. A code merge resolves conflicting text edits; a data merge must evaluate whether combining two statistical distributions preserves analytical validity.

Yet version control for data is one of the most consequential gaps in modern data infrastructure. The ability to branch, experiment, diff, merge, and roll back data — with automated quality gates at every merge point — is essential for reliable AI development. Without it, every data change is a one-way door: irreversible, untestable in isolation, and impossible to reproduce.

Current tools offer fragments of this vision. Time travel provides read-only access to historical states. Write-Audit-Publish provides single-table staging. lakeFS provides lake-wide branching with file-level granularity. None delivers the complete picture — row-level versioning, cross-table branching, ML-driven merge gates, and integrated governance — that production AI workloads demand.

Append-Only Commit Log

A storage pattern where every write operation is recorded as a new entry in a sequential log, and existing entries are never modified or deleted. The current state of the data is the result of replaying all log entries from the beginning — or, in practice, reading the latest snapshot plus any subsequent log entries.

Delta Lake uses an append-only JSON-based transaction log (_delta_log/) where each commit appends a new JSON file (numbered sequentially: 000000.json, 000001.json, ...) recording which data files were added, removed, or had metadata changed. Every 10 commits, a checkpoint file consolidates the cumulative state to avoid replaying the full history. Apache Iceberg uses a similar approach: each commit produces a new metadata file pointing to updated manifest lists and manifest files that track data file additions and removals. Apache Hudi records commits on a timeline with instant metadata capturing each operation's type (commit, delta_commit, compaction, clean).

The append-only property enables three critical capabilities. Time travel: read the state at any historical commit by replaying log entries up to that point. Audit trails: every change is recorded with timestamp, operation type, and (in some implementations) user identity and operation parameters. Concurrent writes: new commits can be validated against the log using optimistic concurrency control — checking for conflicts only at commit time rather than acquiring locks upfront.

In the AI-native infrastructure proposal, the append-only commit log extends to lake-wide scope — every write to any table creates a versioned commit, all commits are retained (with configurable compaction), and the log becomes the single source of truth for the data estate's complete history. This enables lake-wide time travel, cross-table rollback, and the foundation for branching operations.

Cherry-Pick

Applying a specific commit from one branch to another without merging the entire branch — selecting individual changes rather than incorporating all divergent history. Analogous to git cherry-pick, which applies a single commit's changes to the current branch.

In data versioning, cherry-picking is useful when a branch contains multiple independent changes and only some should be promoted to production. For example, a data engineering branch might contain both a schema migration (ready for production) and an experimental transformation (still being validated). Cherry-picking allows promoting the schema migration while leaving the experimental transformation on the branch.

lakeFS supports cherry-pick operations at the commit level. Iceberg's WAP workflow approximates cherry-picking at the table level — specific staged snapshots can be promoted individually. However, data cherry-picking is more complex than code cherry-picking because data changes have statistical properties: cherry-picking a subset of row additions changes the distribution of the resulting dataset, potentially invalidating downstream model assumptions.

No current tool evaluates the statistical implications of cherry-picking data changes — this is a quality assessment that requires the ML-driven merge framework applied at the individual commit level rather than the branch level.

Commit

A discrete, atomic unit of change to a versioned dataset — a snapshot of the data estate's state at a specific point in time, with metadata recording what changed, when, and (ideally) by whom and why.

In Delta Lake, a commit is a JSON file in _delta_log/ recording file-level actions: add (new data files), remove (deleted data files), metaData (schema or property changes), and protocol (format version changes). Each commit has a monotonically increasing version number.

In Apache Iceberg, a commit creates a new snapshot entry in the metadata file, pointing to updated manifest lists. Snapshots are immutable and form a linked chain — each snapshot references its parent, creating a version history.

In lakeFS, a commit creates a point-in-time snapshot of the entire lake at branch scope — all tables, all files, all metadata. Commits have SHA-256 identifiers, commit messages, timestamps, and user metadata. The commit is the atomic unit of branching and merging: you branch from a commit, you merge commits, you diff commits.

For AI workloads, the commit granularity matters: if commits are too coarse (daily full-table snapshots), you can't isolate the specific change that caused a model degradation. If commits are too fine (every individual row write), the commit log becomes unwieldy. The right granularity depends on the workflow — typically, one commit per logical data operation (a pipeline run, an ingestion batch, a transformation step).

Conflict Resolution

The process of reconciling incompatible changes when merging two branches that have both modified the same data. Conflict resolution is the hardest problem in data versioning — and the reason "Git for data" remains aspirational.

In code, conflicts are textual: two developers edited the same lines of a file. Resolution is manual — a human reads both changes and decides what the merged version should look like. In data, conflicts operate at multiple levels:

File-level conflicts: the same Parquet file was modified on both branches. lakeFS detects these and requires programmatic or manual resolution — choosing one version, the other, or a custom merge function. This is the only level most tools support.

Row-level conflicts: the same row (identified by primary key) was modified differently on each branch. Resolving row-level conflicts requires comparing column values and applying merge rules — last-writer-wins, source-priority, custom business logic, or human review. No production tool automates this for data at scale.

Semantic conflicts: both branches made non-overlapping changes that are individually valid but produce invalid combined results. For example, Branch A adds customer records from a new region, and Branch B recalibrates a pricing model assuming the existing regional distribution. Merging both creates a dataset where the pricing model is incorrectly calibrated for the expanded customer base. These conflicts are invisible at the file or row level — detecting them requires understanding the data's statistical and business semantics.

The ML-driven merge framework addresses the semantic conflict problem by evaluating branch quality holistically — running distribution analysis, anomaly detection, and downstream model evaluation on the proposed merge result before committing it.

Data Branching

Creating a divergent copy of a dataset that can be modified independently without affecting the original — analogous to creating a Git branch for code. The branch starts as an identical copy of the main line and evolves independently as changes are made. When the branch's changes are validated, they can be merged back.

Data branching operates at different scopes with significant trade-offs:

Table-level branching (Iceberg WAP): changes are written to a named branch of a single table and only become visible on the main branch after explicit promotion. Scope is narrow: each table is branched independently, so cross-table consistency (referential integrity across a fact table and its dimensions, for example) is not guaranteed across branch operations.

Catalog-level branching (Nessie): branches the metadata of multiple Iceberg tables together, ensuring cross-table consistency for the branched tables. Scope is the set of tables managed by the Nessie catalog. Provides ACID guarantees for multi-table branch operations.

Lake-wide branching (lakeFS): a single branch operation creates a consistent point-in-time copy of the entire data lake — all tables, all files, all directories. Changes to any object on the branch are isolated from main. Uses copy-on-write at the object storage level: the branch initially shares all data with main (zero copy cost), and only objects that are modified on the branch incur additional storage.

Database-level branching (Lakebase/Neon): Databricks Lakebase (GA February 2026), built on Neon's PostgreSQL technology, offers database-level branching — but these are OLTP database branches (PostgreSQL forks), not lakehouse table branches. The branching mechanism uses copy-on-write at the storage page level, optimized for transactional workloads rather than analytical data versioning.

The key question for any branching implementation: what is the isolation boundary? Table-level branching risks cross-table inconsistency. Lake-wide branching provides consistency but at the cost of managing an entire lake copy. The right scope depends on the workflow — but for AI workloads that consume data across many tables, lake-wide branching is typically necessary to guarantee reproducible training environments.

Data Diffing

Identifying the specific differences between two versions of a dataset — analogous to git diff for code. A meaningful data diff answers: what rows were added, modified, or deleted? Which columns changed? What are the old and new values? How did the statistical distribution shift?

Data diffing operates at increasing levels of sophistication:

Object-level diff: identifies which files (Parquet files, JSON files) changed between two versions. lakeFS provides this: it knows which objects were added, modified, or deleted between two commits or branches. Fast and scalable — it's metadata comparison, not data comparison.

Row-level diff: identifies which individual rows changed, with old and new values for each modified column. Requires reading and comparing actual data content, not just file metadata. Computationally expensive at scale — diffing two versions of a billion-row table requires comparing every row. Can be optimized using primary keys (only compare rows with matching keys) and columnar checksums (skip columns that haven't changed).

Statistical diff: identifies how aggregate properties changed between versions — mean, median, distribution shape, cardinality, null rates, correlation structures. Useful for understanding whether a change is "safe" (the statistical properties of the data are preserved) or "dangerous" (a significant distribution shift occurred). Statistical diffing is cheaper than row-level diffing because it operates on aggregates, but it misses individual-row changes that don't affect aggregates.

Semantic diff: identifies changes in the data's meaning or relationships — a new column represents a different business concept, a categorical value's meaning shifted, or a foreign key relationship was broken. This is the hardest level of diffing because it requires understanding business semantics, not just data values.

Effective data diffing is a prerequisite for meaningful data merging: you can't merge two branches if you can't identify what changed on each. It's also essential for data review — the data equivalent of code review, where changes are examined before being merged to production.

DVC (Data Version Control)

An open-source tool that extends Git to handle large files and datasets by storing data externally (in S3, GCS, Azure Blob, or local storage) while tracking lightweight metadata files (.dvc files) in Git. DVC enables versioning of datasets, ML models, and pipeline artifacts using familiar Git workflows — dvc push, dvc pull, dvc checkout.

DVC pipelines define multi-stage data processing workflows as DAGs (in dvc.yaml), with automatic tracking of inputs, outputs, parameters, and metrics for each stage. Running dvc repro re-executes only the stages whose inputs have changed — providing Makefile-like incremental computation for data pipelines.

lakeFS acquired DVC in November 2025, consolidating the two most prominent data versioning projects. The strategic logic: DVC excels at ML experiment versioning (tracking which dataset version + which model version + which hyperparameters produced which results) while lakeFS excels at lake-level data operations (branching, merging, pre-merge hooks). The combined platform would cover both the ML practitioner's workflow (experiment tracking) and the data engineer's workflow (production data management).

DVC's limitation is that it operates alongside the data platform rather than within it — DVC files are tracked in Git, but the actual data sits in object storage without the governance, quality monitoring, or query integration that a warehouse or lakehouse provides. Data versioned with DVC is versioned but not governed.

Fast-Forward Merge

A merge strategy where the target branch is simply updated to point to the source branch's latest commit — possible only when the target branch has had no changes since the source branch diverged. The history is linear: the source branch's commits extend directly from the target branch's last commit with no divergence.

In Git, fast-forward is the default merge strategy when the histories haven't diverged — no merge commit is created, and the branch pointer simply advances. In data versioning, Apache Iceberg supports fast-forward merge via CALL system.fast_forward('table_name', 'source_branch', 'target_branch'), promoting a WAP branch's changes to the main branch.

Fast-forward merge's simplicity is both its strength and limitation. It's atomic, deterministic, and requires no conflict resolution. But it fails whenever the target branch has received any changes since the source branch was created — which, in a production environment with multiple concurrent data producers, is common. If Branch A is created from main, main receives an ingestion update, and then Branch A tries to fast-forward merge, the merge fails because main has diverged.

When fast-forward merge fails, Iceberg provides no built-in fallback. There's no native three-way merge, no automatic conflict resolution, and no merge commit capability. The user must either rebase the branch (recreate it from the current main and reapply changes — which is complex for data) or use an external tool like lakeFS that supports three-way merge.

Immutable Snapshots

Point-in-time captures of a dataset's state that, once created, can never be modified — only new snapshots can be created. Immutability guarantees that any reference to a snapshot will always return the same data, regardless of subsequent changes.

Apache Iceberg's snapshots are the canonical implementation: each snapshot records the complete set of data files that constitute the table at that point. Snapshots are identified by unique IDs, linked in a parent chain, and are truly immutable — the metadata is append-only, and data files referenced by a snapshot are protected from deletion (until the snapshot itself is expired by retention policies).

Delta Lake's versions serve a similar function: each version number corresponds to a specific state of the transaction log, and reading VERSION AS OF 42 always returns the same data (within the retention window).

Immutable snapshots are the foundation of data versioning because they provide referential stability — you can record "this model was trained on snapshot X" and be confident that snapshot X will always contain the same data. Without immutability, recording a dataset version number is meaningless if the underlying data can be modified retroactively.

For AI reproducibility, immutable snapshots solve the training data provenance problem: every model can be linked to the exact snapshot of training data used, enabling reproduction of training runs, investigation of model behavior, and compliance with regulations (like the EU AI Act) that require documentation of training data.

lakeFS

An open-source platform providing Git-like version control for data lakes. lakeFS sits as a compatibility layer between compute engines and object storage, intercepting all reads and writes to manage branching, committing, merging, and diffing at lake-wide scope.

Architecture: lakeFS exposes an S3-compatible API. Compute engines (Spark, Trino, dbt, Airflow) access data through lakeFS URIs (lakefs://repo/branch/path) instead of direct S3 URIs. lakeFS translates these requests to the underlying object storage, managing version metadata in a PostgreSQL or DynamoDB-backed metadata store. This transparency means any tool that reads from S3 can read from lakeFS with only a URI change.

Key capabilities: Zero-copy branching — creating a branch copies only metadata pointers, not data files, making branching instant regardless of lake size. Three-way merge — using common ancestor identification for conflict detection and automatic resolution of non-conflicting changes. Pre-merge hooks — webhooks fired before a merge is committed, enabling custom validation (quality checks, compliance gates, ML evaluation) with the ability to reject the merge. Garbage collection — identifying and deleting data files no longer referenced by any branch or commit.

lakeFS acquired DVC in November 2025, consolidating the two most prominent open-source data versioning projects. The combined platform covers ML experiment versioning (DVC's strength) and production data operations (lakeFS's strength).

Limitations: lakeFS operates at the file level — it tracks Parquet files, not individual rows. A "diff" shows which files changed, not which rows within those files changed. It doesn't integrate natively with warehouse governance (Snowflake access controls, Unity Catalog policies) or quality monitoring frameworks. And the ML-driven merge vision — where pre-merge hooks invoke ML models to evaluate branch quality — is architecturally feasible but requires entirely custom engineering with no reference implementation.

ML-Driven Merge

A proposed capability where the decision to accept or reject a data merge is informed or determined by machine learning models evaluating the quality, safety, and downstream impact of the proposed merge. Instead of merging data based solely on structural compatibility (no file conflicts), ML-driven merge evaluates whether the merged dataset is statistically sound, governable, and safe for downstream consumption.

The evaluation pipeline for an ML-driven merge would assess multiple dimensions:

Schema compatibility: structural validation that the branches' schemas are compatible — column types, nullability, constraints. This is the only dimension current tools check.

Distribution consistency: statistical tests (KS test, PSI, JS divergence) comparing the merged dataset's distributions against the pre-branch baseline. A merge that introduces a significant distribution shift would trigger review or rejection.

Anomaly detection: ML-based identification of outlier records, impossible values, or unusual patterns introduced by the branch's changes.

Downstream model impact: running the merged dataset through downstream models (or a sample thereof) to evaluate whether model performance improves, degrades, or remains stable. This is the most powerful evaluation — it directly measures whether the data change helps or hurts the system's outputs.

Governance compliance: automated verification that the merged dataset maintains compliance with applicable regulations — PII handling, data residency, classification requirements.

No platform implements ML-driven merge natively. The architecture is feasible using lakeFS pre-merge webhooks: the webhook fires an HTTP request to an evaluation service, which diffs the branches, runs evaluation models, and returns accept/reject with explanations. Academic research exists for ML-driven code merge (DeepMerge, MergeGen) but not for data merge — making this a genuine research gap and a compelling product opportunity for the Autonomous DataOps thesis.

Point-in-Time Query

A query that returns data as it existed at a specific historical moment — reading from a past version or snapshot rather than the current state. Point-in-time queries are the read-side mechanism enabled by time travel: the system retains historical versions, and queries specify which version to read.

Snowflake syntax: SELECT * FROM orders AT(TIMESTAMP => '2025-06-15 09:30:00') or SELECT * FROM orders AT(OFFSET => -3600) (one hour ago). Delta Lake syntax: SELECT * FROM orders VERSION AS OF 42 or SELECT * FROM orders TIMESTAMP AS OF '2025-06-15'. Iceberg syntax varies by engine but supports both snapshot ID and timestamp-based queries.

Point-in-time queries serve distinct use cases: Debugging (what did the data look like before the bad pipeline run?), Auditing (what data did this report use when it was generated last Tuesday?), Reproducibility (what training data produced this model?), Recovery (restore the table to yesterday's state after a bad write).

For AI workloads, point-in-time queries are essential for training data reproducibility — the ability to reconstruct the exact dataset used to train a model, even months after training. This is a regulatory requirement under the EU AI Act for high-risk AI systems and a practical necessity for debugging model behavior.

The limitation: point-in-time queries provide read access to historical states but not the ability to create new timelines from those states. You can see what the data looked like yesterday, but you can't branch from yesterday's state and evolve it independently. That requires data branching.

Project Nessie

An open-source transactional catalog for Apache Iceberg that provides Git-like branching with multi-table atomic transactions. Nessie manages Iceberg table metadata — not the data files themselves — enabling branching, tagging, and merging operations across multiple tables with ACID guarantees.

Architecture: Nessie maintains a version store (backed by DynamoDB, MongoDB, PostgreSQL, or RocksDB) that tracks the state of all managed Iceberg tables at each commit. A branch in Nessie is a named pointer to a commit, and a commit records the metadata state of all tables in the catalog at that point.

Key capabilities: Multi-table atomic commits — modify the metadata of several tables in a single atomic transaction, ensuring cross-table consistency. Branch isolation levels — Nessie supports Read Committed, Repeatable Read, and Serializable isolation for branch operations, controlling what concurrent branch readers see during modifications. Tags — named, immutable references to specific commits, useful for marking release points or training data snapshots. Merge — combining changes from a source branch into a target branch with conflict detection at the table-metadata level.

Nessie's scope is narrower than lakeFS but with stronger guarantees within that scope. Nessie branches table metadata (schemas, partition specs, snapshot pointers) with ACID properties, while lakeFS branches the entire lake (including non-table files) with eventual consistency semantics. Nessie doesn't manage non-table objects (raw files, model artifacts, documentation). lakeFS manages everything in object storage but doesn't provide the same transactional guarantees for metadata operations.

Dremio (creators of Nessie) and the broader Iceberg community envision Nessie as the transactional foundation for a lakehouse catalog that combines data governance with version control — the "Git for data" vision implemented at the catalog layer.

Reproducibility

The ability to recreate the exact conditions and inputs that produced a specific output — essential for debugging, auditing, regulatory compliance, and scientific validity. In data infrastructure, reproducibility means being able to answer: "Given the same data, the same code, and the same configuration, will we get the same result?"

Full ML reproducibility requires versioning across four dimensions:

Data versioning: the exact training dataset, feature values, and evaluation data used for a model training run. Addressed by time travel, snapshots, and branching — but only if the versioning captures all data sources, not just the final training table.

Code versioning: the exact transformation logic, feature engineering code, and model training code. Addressed by Git for application code and dbt for SQL transformations.

Environment versioning: the exact software dependencies, library versions, hardware configuration, and random seeds. Addressed by Docker containers, Conda environments, and infrastructure-as-code.

Configuration versioning: the exact hyperparameters, feature selections, model architecture choices, and pipeline parameters. Addressed by experiment trackers like MLflow (which logs parameters, metrics, and artifacts for each run).

The gap: no single platform versions all four dimensions together. Data versioning tools (lakeFS, DVC) don't track code or environment. Code versioning tools (Git) don't track data or environment. Experiment trackers (MLflow) log parameters and metrics but rely on external systems for data and environment versioning. Achieving full reproducibility requires manually coordinating across these systems — recording which Git commit, which data snapshot, which Docker image, and which MLflow run ID together constitute a reproducible experiment.

The EU AI Act's documentation requirements for high-risk AI systems effectively mandate this level of reproducibility — but no production toolchain delivers it without significant custom integration.

Retention Policy

The rules governing how long historical data versions are preserved before being eligible for deletion. Retention policies balance storage costs (keeping every version forever is expensive) against the need for time travel, auditability, and reproducibility.

Snowflake's Time Travel retention is configurable from 0 to 90 days (Enterprise Edition; 1 day for Standard). After the Time Travel period, data enters Fail-safe — a 7-day Snowflake-managed recovery window that is not user-accessible. Delta Lake's default retention is 30 days for the transaction log and 7 days for vacuum-eligible data files, configurable via delta.logRetentionDuration and delta.deletedFileRetentionDuration. Iceberg's snapshot retention is managed by expire_snapshots procedures that remove snapshots older than a configurable threshold.

The tension: AI reproducibility often requires access to data versions months or years old (to reproduce a model training run from 6 months ago), while storage costs pressure organizations to minimize retention. Tiered storage strategies (recent versions in hot storage, older versions in cold/archive storage) partially address this, but most platforms don't support automatic tiering of historical versions — all versions within the retention window are stored at the same storage tier.

For regulatory compliance, retention policies must account for mandatory minimum retention periods — HIPAA requires 6 years, BCBS 239 requires sufficient history for trend analysis and stress testing, and the EU AI Act requires documentation of training data for the lifetime of high-risk AI systems. These requirements can conflict with storage cost optimization.

Rollback

Reverting a dataset or database to a previous known-good state — undoing recent changes to recover from errors. Rollback is the simplest and most critical versioning operation: the ability to say "go back to how it was before."

Snowflake supports rollback through multiple mechanisms: SELECT ... AT(TIMESTAMP => ...) reads historical data that can be used to overwrite the current table. UNDROP TABLE recovers accidentally dropped tables. CREATE TABLE ... CLONE ... AT(TIMESTAMP => ...) creates a clone from a historical point for comparison or recovery.

Delta Lake provides RESTORE TABLE table_name TO VERSION AS OF 42 (or TO TIMESTAMP AS OF '2025-06-15'), which atomically reverts the table to a historical version. Under the hood, this creates a new commit in the transaction log that sets the table's file list to match the historical version's file list — the old data files are still present (they were never deleted within the retention window), and the restore simply re-references them.

Iceberg supports rollback via CALL system.rollback_to_snapshot('table_name', snapshot_id) or CALL system.rollback_to_timestamp('table_name', timestamp).

Rollback is powerful but limited: it's a one-dimensional operation along the time axis. You can go back, but you can't create parallel timelines (branches) or selectively apply changes from different time points (cherry-pick). And rollback at the table level doesn't guarantee cross-table consistency — rolling back a fact table without rolling back its dimension tables can create referential integrity violations. Lake-wide rollback (reverting the entire lake to a consistent historical state) requires lake-wide versioning, which only lakeFS provides.

Tag

A named, immutable reference to a specific version or commit in a versioned dataset — a permanent bookmark that says "this exact state of the data has a name." Tags are the versioning equivalent of release versions in software: v1.0, training-data-2025-Q3, pre-migration-baseline.

Nessie supports tags as first-class objects: CREATE TAG training_v1 AT main creates a named reference to main's current commit. lakeFS supports tags similarly. Delta Lake doesn't have native tag support — the closest equivalent is recording version numbers or timestamps externally (in MLflow, in documentation, in a metadata table).

Tags serve distinct purposes from branches. A branch is mutable — it evolves as new commits are added. A tag is immutable — it always references the same commit. This immutability makes tags essential for:

Training data snapshots: tagging the exact data version used to train a production model, ensuring reproducibility. Compliance checkpoints: tagging the data state at regulatory reporting dates. Release coordination: tagging the data version that corresponds to a software release. Safe points: tagging known-good states before risky operations (migrations, schema changes, bulk updates).

For AI workloads, the combination of tags and immutable snapshots provides the foundation for training data provenance: "Model v3.2 was trained on data tagged training-2025-Q3, which references snapshot snap-7a3f2b1c containing 47.3M records across 12 tables." This chain — from model version to tag to snapshot to data files — is the minimum provenance required for regulatory compliance and debugging.

Three-Way Merge

A merge strategy that uses three reference points — the source branch, the target branch, and their common ancestor (the commit from which both branches diverged) — to intelligently combine changes from both branches. By comparing each branch against the common ancestor rather than directly against each other, the algorithm can distinguish "changed on source only," "changed on target only," and "changed on both" — enabling automatic resolution of non-overlapping changes.

In Git, three-way merge is the standard algorithm for combining divergent branches, and it works well because code changes are localized — editing function A doesn't usually conflict with editing function B. In data versioning, lakeFS is the primary tool implementing three-way merge for data: it identifies the common ancestor commit, diffs both branches against the ancestor at the file level, and merges non-conflicting changes automatically. Files modified on only one branch are accepted directly. Files modified on both branches are flagged as conflicts requiring resolution.

Three-way merge is substantially more powerful than fast-forward merge (which requires one branch to be completely unchanged) and is essential for collaborative data development where multiple teams modify different parts of the data estate concurrently. Without three-way merge, parallel data development is impractical — every concurrent modification creates a conflict that can only be resolved by discarding one branch's changes entirely.

The limitation of current implementations: three-way merge operates at the file level (which Parquet files changed?), not the row level (which rows within those files changed?). A file modified on both branches is flagged as a conflict even if the branches changed completely different rows within that file. Row-level three-way merge would resolve many more conflicts automatically, but requires reading and comparing actual data content — a computationally expensive operation at scale.

Time Travel

The ability to query data as it existed at a previous point in time — reading from a historical version or snapshot rather than the current state. Time travel is the foundation capability that makes data versioning possible: without the ability to read historical states, there's nothing to branch from, diff against, or roll back to.

Snowflake: Time Travel supports queries up to 90 days into the past (Enterprise Edition; 1 day for Standard Edition). Syntax: SELECT * FROM table AT(TIMESTAMP => '2025-01-15 10:00:00') or AT(OFFSET => -3600) or AT(STATEMENT => 'query-id'). Powered by micro-partition versioning — when a DML operation modifies data, new micro-partitions are created while old ones are retained for the Time Travel window. After Time Travel expiration, data enters a 7-day Fail-safe period (Snowflake-managed, not user-accessible).

Delta Lake: Version-based time travel with configurable retention (default 30 days). Syntax: SELECT * FROM table VERSION AS OF 42 or TIMESTAMP AS OF '2025-06-15'. Each version corresponds to a commit in the transaction log. Versions beyond the retention window are removed by the VACUUM command.

Apache Iceberg: Snapshot-based time travel. Each write creates an immutable snapshot with a unique ID. Syntax varies by engine — Spark: spark.read.option("snapshot-id", id).table("t"). Snapshots are managed by expire_snapshots procedures.

Time travel is invaluable but is not branching: you can read historical states but can't create parallel timelines that evolve independently, diff two versions structurally, or merge changes from different time points back together. Time travel is the read-only foundation; branching, merging, and diffing are the read-write operations that make data version control complete.

Version Lineage

The graph of relationships between data versions — which version was derived from which, through what operations, by what actors. Version lineage extends data lineage (which tracks how data flows between tables and systems) with temporal depth: not just "table B was derived from table A," but "version 47 of table B was derived from version 23 of table A using transformation commit abc123."

Version lineage is critical for four use cases. Root cause analysis: when model quality degrades, trace back through version lineage to identify which data change introduced the problem. Impact analysis: before modifying a data source, understand which downstream versions (and which models trained on those versions) will be affected. Compliance: demonstrate to regulators the complete provenance chain from source data through every transformation to the final output. Reproducibility: reconstruct the exact data state that produced a specific output by following the version lineage graph backward.

Current tools provide fragments of version lineage: Delta Lake's transaction log records per-table version history. Unity Catalog tracks column-level lineage across tables. MLflow logs the data version used for each experiment. But connecting these fragments — linking a Delta version to the Unity Catalog lineage to the MLflow experiment to the model serving endpoint to the business outcome — requires cross-system integration that no single platform provides.

The metadata knowledge graph proposed in the AI control plane architecture would serve as the unified version lineage store — connecting data versions, transformation commits, model training runs, and deployment events in a single queryable graph.

Write-Audit-Publish (WAP)

An Apache Iceberg workflow pattern that provides structured data staging: changes are written to a named branch (write), validated against quality rules (audit), and only then promoted to the main table (publish). WAP creates an isolated environment where new data can be tested without affecting production readers.

The workflow: Write — direct data writes to a named branch (WRITE TO branch_name) rather than the main table. The branch is an isolated snapshot that receives the new data while the main table remains unchanged. Audit — run quality checks, validation rules, and governance policies against the branch. This might include schema conformance tests, data profiling, distribution analysis, or business rule validation. Publish — if all checks pass, promote the branch to main via fast-forward merge. If checks fail, the branch is discarded with no impact on production.

WAP is the most widely available form of data branching in the lakehouse ecosystem because it's built into Iceberg's table specification. It's supported by Spark, Trino, and other Iceberg-compatible engines without requiring external tools.

Limitations: WAP is single-table-scoped — each table is branched independently, so cross-table consistency isn't guaranteed. It supports only fast-forward merge — if main has changed since the branch was created, the publish step fails. It doesn't support three-way merge, cherry-pick, or conflict resolution. And the audit step is entirely user-implemented — WAP provides the staging mechanism but no built-in quality evaluation framework.

For simple data staging workflows (validate new data before it hits production), WAP is effective. For complex multi-table experimentation with concurrent branches, ML-driven quality gates, and lake-wide consistency, tools like lakeFS or Nessie are needed.

Part of The Sciencer Wiki. Maintained by The Sciencer Company.