Data Quality & Observability

Technical definitions for data quality and observability concepts: the six quality dimensions, observability platforms, anomaly detection, schema drift, distribution shift, data profiling, the AI quality chain, hallucination detection, golden-set regression, and why monitoring tables isn't enough when the consumer is an AI agent.

Data Quality & Observability

Data quality and observability answer a deceptively simple question: can you trust your data?

In the BI era, "trust" meant: does the table have the right schema, the expected row count, reasonable value distributions, and acceptable freshness? These are checkable properties of tabular data — and a mature ecosystem of tools (Monte Carlo, Great Expectations, Soda, dbt tests, Snowflake DMFs) monitors them effectively.

For AI, the question transforms. "Trust" means: will this data produce correct, safe, and useful AI outputs? That requires monitoring not just the data itself but the entire chain from raw data through features, embeddings, model predictions, agent actions, and business outcomes. Each link introduces failure modes that table-level quality monitoring can't detect. A feature table with perfect schema conformance can contain embeddings from a deprecated model version. A RAG pipeline can retrieve structurally valid documents that are semantically irrelevant. A prediction can be statistically accurate but operationally dangerous.

This cluster covers quality and observability from the BI baseline through the AI frontier — what works today, what's missing, and what "quality" needs to become.


AI Quality Chain

The end-to-end chain of quality that determines whether an AI system produces correct, safe, and useful outputs. The chain extends far beyond traditional data quality through six links:

Data qualityFeature qualityModel qualityPrediction qualityAction qualityOutcome quality

Each link introduces failure modes invisible to the link before it. Data quality tools verify that source tables have correct schemas and expected distributions — but can't detect that a feature engineering pipeline produced embeddings from a deprecated model version. Feature quality checks verify that features are fresh and statistically stable — but can't assess whether a model trained on those features generalizes to production data. Model quality metrics (accuracy, F1, AUC) capture average performance — but miss the long-tail failures where an agent's decision is catastrophically wrong for a specific customer segment.

The critical insight: errors compound through the chain. A subtle distribution shift in data quality (link 1) may cause a minor feature drift (link 2), which produces a model performance regression (link 3), which generates an incorrect prediction (link 4), which triggers an inappropriate agent action (link 5), which causes a measurable business harm (link 6). Current tools monitor link 1 only. No production platform monitors the complete chain end-to-end — because the chain crosses system boundaries (warehouse → feature store → model registry → model serving → agent framework → operational system) that no single tool spans.

The AI quality chain is the most important unsolved monitoring problem in AI infrastructure. The 80%+ AI project failure rate is not primarily a modeling problem — it's a quality chain monitoring problem.


Anomaly Detection (Data)

Automated identification of unusual patterns in data metrics — unexpected changes in row counts, null rates, value distributions, freshness, or schema. Data observability tools use statistical models to establish baseline behavior for each metric, then flag deviations that exceed a confidence threshold.

The detection approaches vary in sophistication. Static thresholds: alert when a metric exceeds a fixed value (null rate > 5%, row count < 1000). Simple, interpretable, but requires manual threshold setting per metric per table — doesn't scale. Statistical thresholds: compute mean and standard deviation from historical values, alert when the current value falls outside k standard deviations (typically k=3). Adapts to the metric's natural variability but assumes normal distribution and stationary behavior. ML-based detection: train time-series models (Prophet, ARIMA, isolation forests) on metric history to learn seasonal patterns, trends, and normal variability. Alert on deviations from the learned pattern. Monte Carlo and Anomalo use ML-based approaches.

Snowflake's Data Metric Functions support anomaly detection by comparing current DMF values against historical baselines using built-in statistical methods. Databricks' Lakehouse Monitoring provides automated statistical profiling and drift detection on Delta tables.

The limitation for AI: statistical anomaly detection operates at the table level — it detects when a column's null rate spikes or a distribution shifts, but not when that shift causes a downstream model to produce worse predictions. The causal connection between a data anomaly and an AI quality regression requires cross-system reasoning across the quality chain — reasoning that no current tool provides.


Circuit Breaker (Data Pipeline)

A pattern borrowed from electrical engineering and microservice architecture: when a data quality check fails, the pipeline stops processing rather than propagating bad data downstream. The circuit "opens" (halts data flow), preventing corrupted or invalid data from reaching consumers, and "closes" (resumes flow) only when the quality issue is resolved.

In practice, circuit breakers are implemented through quality gates at pipeline stage boundaries. dbt tests can be configured to fail a run when assertions are violated, blocking downstream models from executing against bad data. Lakeflow Expectations can quarantine records that fail validation, preventing them from entering downstream tables. Great Expectations checkpoints can halt pipeline execution when expectation suites fail.

The challenge is calibrating sensitivity. Too sensitive: the circuit breaker triggers on normal data variation, blocking legitimate data flow and creating false alarms. Too lenient: bad data passes through and corrupts downstream systems before anyone notices. Most teams err on the side of leniency (alert but don't block) because the cost of unnecessary pipeline stoppage — stale dashboards, delayed reports, broken SLAs — is immediately visible, while the cost of bad data propagation is delayed and diffuse.

For AI pipelines, circuit breakers need to evaluate quality dimensions beyond schema and statistics: embedding quality (are vectors from the correct model version?), semantic consistency (do extracted entities match expected ontologies?), and downstream impact (does this data change degrade model performance on the golden set?). No current circuit breaker implementation evaluates these AI-specific quality dimensions.


Data Completeness

A quality dimension measuring whether all expected data is present — no missing records, no missing fields, no missing time periods. A table that should have one record per customer per day is incomplete if any customer-day combinations are absent.

Completeness has multiple facets. Row completeness: are all expected records present? (Compare actual row count against expected count based on source system activity.) Column completeness: are all expected fields populated? (Null rate per column within acceptable bounds.) Temporal completeness: are all expected time periods represented? (No gaps in a daily partitioned table.) Referential completeness: do all foreign key references resolve? (Every order has a matching customer record.)

Monitoring completeness requires knowing what "complete" means — which requires either explicit expectations (data contracts, dbt tests) or statistical baselines (historical row counts, typical null rates). Monte Carlo monitors completeness as part of its volume and freshness pillars. dbt's not_null, unique, and relationships tests check specific completeness aspects.

For AI, completeness has a temporal dimension that BI tools don't address well. A model training on historical data needs complete feature coverage for every training example at the time of prediction — not at the current time. If a feature was added in March but the model trains on January data, January examples have missing features. This temporal completeness problem is particularly acute for feature stores where features are added incrementally over time.


Data Downtime

A metric — popularized by Monte Carlo — measuring the period during which data is missing, inaccurate, or otherwise unusable. Analogous to application downtime (the period a service is unavailable), data downtime quantifies the business impact of data quality issues in terms of time.

Data downtime includes: time with missing data (ingestion failure), time with stale data (pipeline delay beyond SLA), time with incorrect data (bad transformation, schema drift, upstream error), and time until detection (the gap between when the issue occurred and when someone noticed). Monte Carlo's research found that the average organization experiences 4–16 hours of data downtime per week, with detection latency accounting for a significant portion — teams often discover data issues hours or days after they occurred.

Data downtime is a useful business-level metric because it translates technical quality issues into a dimension executives understand: time. "Our revenue dashboard had incorrect data for 6 hours on Tuesday" is more actionable than "the null rate in the revenue column exceeded 3 standard deviations from the 30-day moving average."

For AI, data downtime takes on additional urgency because agents make autonomous decisions based on data. Six hours of stale data in a BI dashboard means analysts see old numbers. Six hours of stale data for an agent means six hours of potentially wrong autonomous actions — recommendations, approvals, classifications, or trades based on outdated information.


Data Freshness

A measure of how recently data was updated relative to its source — the time delta between the latest change in the source system and its reflection in the analytical or AI platform.

Freshness operates at every pipeline stage and the total freshness lag is cumulative. If a customer updates their address in the CRM (source), and the data must travel through CDC ingestion (seconds), warehouse loading (minutes), dbt transformation (scheduled hourly), quality validation (minutes after dbt), and materialized view refresh (on-demand) — the total freshness lag is the sum of all stage latencies: potentially 1–3 hours even with CDC-based real-time ingestion at the first stage.

In HTAP systems, freshness is the primary quality metric. HTAP benchmarks (CH-benCHmark, HyBench) explicitly measure the delay between a transaction committing and that data becoming visible to analytical queries — ranging from microseconds (HyPer) to seconds (TiDB) to minutes (log-shipping architectures).

Monte Carlo monitors freshness by tracking when tables were last updated and alerting when the gap exceeds learned baselines. Snowflake's FRESHNESS DMF provides a built-in freshness check. dbt's freshness source configuration alerts when source tables haven't been updated within expected intervals.

For AI agents, freshness requirements depend on the use case. A fraud detection agent needs sub-second freshness. A customer recommendation agent might tolerate minutes. A financial reporting agent might accept hours. But the key insight is that the freshness requirement should be declared per use case (an SLA) and monitored end-to-end across the full pipeline — not measured per stage in isolation.


Data Metric Functions (DMFs)

Snowflake's native mechanism for defining and scheduling data quality checks. DMFs are SQL-based functions that evaluate a table and return a numeric value — any quality assertion expressible as a SQL query returning a number can be implemented as a DMF.

Built-in DMFs include: NULL_COUNT (number of null values in a column), ROW_COUNT (total rows in a table), ACCEPTED_VALUES (rows with values in a specified set), DUPLICATE_COUNT (number of duplicate values), and FRESHNESS (time since the table was last updated).

Custom DMFs extend this to arbitrary business rules:

CREATE DATA METRIC FUNCTION revenue_outlier_count(
  ref TABLE(revenue NUMBER)
) RETURNS NUMBER AS
  'SELECT COUNT(*) FROM ref WHERE revenue < 0 OR revenue > 10000000';

DMFs are scheduled on a configurable interval (down to every 5 minutes for Enterprise Edition), and Snowflake's anomaly detection compares current DMF values against historical baselines to identify deviations automatically.

DMFs represent the state of the art for BI-era data quality: SQL-defined, table-scoped, schema-and-statistics-focused. They can monitor any property expressible as a SQL aggregation over table data. They cannot evaluate AI-specific quality dimensions — embedding quality, retrieval relevance, model performance, semantic correctness — because these concepts exist outside the SQL domain. A DMF can count null values in an embedding column; it cannot assess whether the embeddings are from the correct model version or produce relevant similarity search results.


Data Observability

A discipline — and a market category coined by Monte Carlo in 2019 — focused on understanding the health of data in an organization by continuously monitoring five pillars: freshness (is data current?), volume (are record counts expected?), schema (has the structure changed?), distribution (are values within expected ranges?), and lineage (where did data come from and where does it go?).

Data observability is to data pipelines what Application Performance Monitoring (APM) is to web services. Just as Datadog monitors service latency, error rates, and throughput to detect application issues, Monte Carlo monitors data freshness, volume changes, and distribution shifts to detect data issues. The analogy Monte Carlo used to position itself — "the New Relic for data teams" — was the category-creating positioning that built a $1.6B valuation.

The five pillars map directly to the BI quality model: freshness matters because dashboard consumers need current data. Volume matters because missing or duplicated data corrupts aggregations. Schema matters because BI tools break on unexpected columns. Distribution matters because anomalous values distort metrics. Lineage matters because root-cause analysis requires tracing issues upstream.

Monte Carlo's platform connects to data warehouses and lakehouses (Snowflake, Databricks, BigQuery), ingests metadata and query logs, builds ML-based baselines for each table's normal behavior, and alerts when deviations occur. In 2025, Monte Carlo launched AI monitoring agents — extending observability to AI/ML pipeline metadata.

The limitation is structural: data observability monitors data pipeline health, not AI output correctness. A table can be perfectly healthy (fresh, complete, schema-stable, normally distributed) while producing terrible AI outputs because the quality issue lies downstream — in feature engineering, model performance, retrieval relevance, or action appropriateness. Observability is necessary but not sufficient for AI quality.


Data Profiling

The process of examining a dataset to collect summary statistics, understand its structure, and identify potential quality issues — before applying any transformations or quality rules. Profiling answers: what columns exist, what are their types, what are the value distributions, how many nulls, how many distinct values, what are the min/max/mean, are there obvious outliers?

Profiling is typically the first step in a data quality workflow. Tools like Great Expectations, Soda, and pandas-profiling generate comprehensive statistical profiles of datasets: per-column statistics (mean, median, standard deviation, percentiles, null count, distinct count), cross-column correlations, value frequency distributions, and pattern analysis (detecting date formats, email patterns, phone numbers in string columns).

Databricks' Lakehouse Monitoring provides automatic profiling for Delta tables — computing and tracking statistical profiles over time, enabling drift detection between profile snapshots. Snowflake's SYSTEM$PROFILING functions provide similar capabilities within the Snowflake ecosystem.

Profiling serves multiple purposes: Discovery (understanding an unfamiliar dataset), Quality assessment (identifying issues before they affect downstream consumers), Baseline establishment (creating the statistical reference against which anomaly detection operates), and Schema inference (determining appropriate data types and constraints for tables being modeled).

For AI, profiling must extend beyond tabular statistics. Profiling an embedding column requires computing: dimensionality, magnitude distribution, nearest-neighbor distance distributions, cluster structure, and alignment with expected embedding model behavior. Profiling a document column requires: format distribution, language detection, length statistics, and content classification. These multi-modal profiling capabilities don't exist in current tools.


Data Validation

The process of checking data against explicit, deterministic rules before it's accepted into a system. Validation is rule-based: "this column must be non-null," "this value must be between 0 and 100," "this foreign key must resolve to an existing record," "this date must be in the past." Data that fails validation is rejected, quarantined, or flagged.

Great Expectations is the dominant open-source data validation framework. It defines Expectations — composable, declarative assertions about data properties — organized into Expectation Suites (collections of expectations for a dataset) evaluated by Checkpoints (execution contexts that connect suites to data sources). Example expectations: expect_column_values_to_not_be_null, expect_column_values_to_be_between, expect_column_pair_values_A_to_be_greater_than_B, expect_column_values_to_match_regex.

Soda provides a YAML-based quality definition language (SodaCL) that's more concise than Great Expectations for common checks. dbt tests embed validation in the transformation layer: unique, not_null, accepted_values, and relationships are built-in tests, with custom tests for arbitrary SQL assertions.

Databricks' Lakeflow Expectations evaluate SQL boolean expressions against Delta table rows at write time — operating inline with data processing rather than as separate scheduled checks. Records violating expectations can be flagged, quarantined to an error table, or used to fail the writing job.

Validation is complementary to observability. Validation catches known failure modes (rule violations you anticipated). Observability catches unknown failure modes (statistical anomalies you didn't anticipate). A comprehensive quality strategy uses both: validation gates at pipeline boundaries catch expected issues, while observability monitoring detects the unexpected.


Distribution Shift

A statistical change in the probability distribution of a data column or feature over time. Distribution shift is one of the most insidious quality issues for ML systems because data may remain structurally valid — correct types, no nulls, proper value ranges — while its statistical properties have changed enough to degrade model performance significantly.

Distribution shift types: Covariate shift — the distribution of input features changes while the relationship between features and target remains stable. Example: a loan model trained primarily on urban applicants starts receiving rural applications with different income distributions. Label shift — the distribution of target values changes. Example: fraud rates increase during holiday seasons, shifting the class balance. Concept drift — the relationship between features and target changes. Example: the features that predict customer churn today are different from those that predicted churn a year ago because the product changed.

Detection methods: Kolmogorov-Smirnov (KS) test — compares the cumulative distribution functions of two samples and produces a statistic measuring the maximum difference. A KS statistic above a threshold (typically with p-value < 0.05) indicates statistically significant distribution change. Population Stability Index (PSI) — measures the shift between a reference distribution and a current distribution by summing the percentage-of-total differences weighted by the log ratio. PSI values above 0.1–0.2 indicate meaningful shift. Jensen-Shannon divergence — a symmetric measure of the similarity between two probability distributions, bounded between 0 (identical) and 1 (completely different).

Distribution shift is a critical concern for ML-driven merge decisions in the data versioning context: when merging a data branch back to main, a distribution shift test can flag whether the branch's data is statistically consistent with the production baseline. This is an automated quality gate that prevents experimental data with unexpected distributions from contaminating production — but no current platform provides it natively.


Five Pillars of Data Observability

The framework — established by Monte Carlo — for categorizing what data observability monitors: Freshness, Volume, Schema, Distribution, and Lineage. Each pillar addresses a distinct dimension of data health.

Freshness: When was the table last updated? Is the update frequency within expected bounds? A daily table that hasn't been updated in 48 hours is a freshness violation.

Volume: How many records does the table contain? Is the count within expected bounds? A table that normally receives 100K records per day receiving only 500 indicates a potential upstream issue.

Schema: Has the table's structure changed? New columns, dropped columns, type changes, renamed columns — any structural change that downstream consumers may not expect.

Distribution: Are the statistical properties of column values within expected ranges? A revenue column that normally has a mean of $500 suddenly showing a mean of $50,000 indicates a potential data error, unit change, or upstream issue.

Lineage: Where did this data come from? What downstream tables, dashboards, or models depend on it? When an issue is detected in one table, lineage traces the impact to all affected consumers and traces the root cause to the originating source.

The five pillars were designed for BI data quality — they monitor the properties that matter for dashboard correctness. For AI, additional pillars are needed: Embedding quality (are vector representations generated by the expected model version, with expected dimensional properties?), Semantic consistency (does the meaning of data fields match expectations, not just their statistical profile?), Model performance (are downstream models degrading?), and Action safety (are agent actions within acceptable bounds?). These additional dimensions constitute the AI quality chain.


Golden-Set Regression Testing

A quality assurance practice where a curated set of known-correct input-output pairs (the "golden set") is used to continuously validate that system behavior hasn't regressed. If a system that previously produced the correct output for golden-set inputs starts producing different outputs, a regression has occurred.

In data infrastructure, golden-set regression applies to transformations and models. A dbt model has a golden set: a set of input records with known expected outputs. After any model change, the golden set is re-evaluated. If outputs change unexpectedly, the change is flagged for review. MLflow's evaluation framework supports golden-set regression for ML models: a curated evaluation dataset is scored against every model version, and performance metrics are tracked over time.

For AI agents, golden-set regression is particularly valuable. A set of representative scenarios (customer queries, document processing tasks, decision situations) with known-correct agent responses serves as a continuous regression test. After any change to the data, features, models, prompts, or governance policies, the golden set is re-evaluated. If agent behavior changes for any golden-set scenario, the change is caught before production deployment.

The challenge is maintaining the golden set: it must be representative (covering the distribution of real-world inputs), current (reflecting the current data and business context), and curated (with verified correct outputs — which requires human judgment for complex AI outputs). MLflow 3.0's Automated LLM-Human Feedback (ALHF) system addresses part of this by collecting human evaluations that can seed and maintain golden sets.


Great Expectations

The most widely adopted open-source data validation framework. Great Expectations provides a Python-based framework for defining, executing, and managing data quality assertions ("expectations") against any data source accessible via Python.

The architecture has three core concepts. Expectations: individual assertions about data properties, composable and parameterized. The library includes 300+ built-in expectations (e.g., expect_column_values_to_be_unique, expect_column_mean_to_be_between, expect_table_row_count_to_be_between) and supports custom expectations. Expectation Suites: collections of expectations that together define the quality profile of a dataset. Checkpoints: executable configurations that connect suites to data sources, run validations, and produce results (pass/fail per expectation, with details on failing records).

Great Expectations generates Data Docs — HTML documentation of expectations and validation results — providing human-readable quality reports. It integrates with Airflow (checkpoint as a task), dbt (validation after transformation), and most data platforms (via SQLAlchemy connections or Pandas/Spark DataFrames).

The framework requires manual configuration: someone must define expectations for each dataset. This "expectations-as-code" approach brings software engineering discipline to data quality (version-controlled, testable, reviewable) but doesn't scale without automation. Defining expectations for 500 tables × 20 columns × 5 expectations each = 50,000 expectations — a significant ongoing maintenance burden.

For autonomous data quality, expectations would be auto-generated from data profiling (statistical baselines become automatic expectations) and auto-evolved as data patterns change. No current tool fully automates this loop.


Ground Truth

The known-correct answer against which a system's outputs are evaluated. In ML, ground truth is the labeled dataset: the "correct" classification, the actual outcome, the verified value that predictions are compared against.

Ground truth is expensive to produce because it requires human judgment, domain expertise, or verified real-world outcomes. For a sentiment classification model, ground truth means human annotators labeling thousands of text samples as positive, negative, or neutral. For a fraud detection model, ground truth means verified fraud investigations that confirm whether flagged transactions were actually fraudulent. For an agent action evaluation, ground truth means human assessment of whether the agent's decision was correct, safe, and appropriate.

The ground truth challenge intensifies with generative AI. For a RAG pipeline that generates answers from documents, "ground truth" is ambiguous — there may be multiple correct answers, varying levels of completeness, and context-dependent quality standards. Evaluation frameworks (MLflow's genai.evaluate(), RAGAS, TruLens) use proxy metrics — faithfulness, relevance, harmlessness — that approximate ground truth without requiring it for every evaluation.

For Autonomous DataOps, ground truth feeds the continuous evaluation loop. Agent actions are compared against ground truth (when available) and human feedback (when ground truth isn't available) to maintain a running quality assessment. MLflow 3.0's ALHF system collects the human feedback that generates ground truth for agent quality evaluation.


Hallucination Detection

The process of identifying cases where an AI model generates information that is not supported by its input data or retrieved context — producing plausible-sounding but factually incorrect output. Hallucination is the defining failure mode of generative AI and a first-order quality concern for any AI system that produces text, recommendations, or decisions.

Hallucination detection approaches: Factual verification — checking generated claims against a knowledge base or the retrieved context. If the model claims "the contract expires in March 2026" and the source document says "March 2027," the claim is a hallucination. Faithfulness scoring — measuring whether the generated output is entailed by (logically follows from) the provided context, without introducing unsupported claims. RAGAS and TruLens compute faithfulness scores using LLM-as-judge evaluation. Confidence calibration — assessing whether the model's expressed confidence aligns with its actual accuracy. Poorly calibrated models state incorrect facts with high confidence. Provenance tracking — requiring the model to cite specific source passages for each claim, enabling verification. Database provenance tools (as the Cambridge Report notes) can play "a crucial role in validating outputs."

Hallucination detection is part of the AI quality chain at the prediction quality and action quality links. A data table can have perfect quality (link 1), features can be correctly computed (link 2), and the model can have high benchmark accuracy (link 3) — but the model still halluccinates on specific production inputs (link 4), causing an agent to take an incorrect action (link 5).

No data infrastructure platform includes native hallucination detection. Snowflake's Cortex AI functions don't evaluate output faithfulness. Databricks' MLflow provides evaluation framework hooks (scorers) but the hallucination detection logic must be implemented externally. This gap — quality evaluation that spans from data through model output — is one of the most important unsolved problems in AI infrastructure.


Lakeflow Expectations

Databricks' mechanism for defining and enforcing data quality rules inline with data processing. Expectations are SQL boolean expressions attached to Delta table writes — evaluated as data flows through pipelines, not as separate scheduled checks.

@dlt.expect("valid_revenue", "revenue >= 0")
@dlt.expect_or_drop("non_null_id", "id IS NOT NULL")
@dlt.expect_or_fail("valid_date", "order_date <= current_date()")
def cleaned_orders():
    return dlt.read("raw_orders")

Three enforcement modes: Expect (log the violation but allow the record through), Expect or Drop (silently remove violating records), Expect or Fail (halt the pipeline on any violation — a circuit breaker). Violation metrics are tracked and visible in the Delta Live Tables UI.

Lakeflow Expectations are architecturally closer to continuous quality monitoring than scheduled checks like DMFs. Quality is evaluated as data moves through the pipeline, catching issues at write time rather than after the fact. The limitations mirror DMFs: expectations are SQL-based and table-scoped. They can validate that a revenue column is non-negative but can't evaluate whether an embedding column contains vectors from the correct model version.


Monte Carlo

The company that created the "Data Observability" category in 2019, providing an automated monitoring platform that detects, alerts on, and helps resolve data quality issues across data warehouses, lakehouses, and data lakes. Monte Carlo's positioning — "the Datadog for data" or "the New Relic for data teams" — anchored the category to the well-understood application monitoring paradigm.

Monte Carlo connects to data platforms (Snowflake, Databricks, BigQuery, Redshift), ingests metadata (table schemas, query logs, freshness timestamps, row counts), builds ML-based baselines of normal behavior per table, and generates automated alerts when metrics deviate from learned patterns. The platform provides incident management (grouping related alerts), root cause analysis (tracing issues through lineage), and impact analysis (identifying downstream tables, dashboards, and models affected by an issue).

In 2025, Monte Carlo launched AI monitoring agents — extending observability to AI/ML pipeline metadata: model performance metrics, feature drift, embedding quality, and inference latency. This represents the observability category's recognition that monitoring tables alone is insufficient for AI workloads.

Monte Carlo's competitive positioning matters for the Autonomous DataOps thesis: Monte Carlo tells you what broke. It doesn't tell you why it broke across multiple systems, and it doesn't fix it. The gap between detection and remediation — where data engineers spend 30–50% of their time — is the gap Autonomous DataOps aims to close.


Population Stability Index (PSI)

A statistical metric that quantifies the shift between two distributions — typically a reference (baseline) distribution and a current distribution. PSI is widely used in credit risk modeling (required by regulators to monitor scorecard stability) and increasingly applied to ML feature monitoring.

PSI is calculated by: (1) binning both distributions into the same set of intervals, (2) computing the percentage of observations in each bin for both distributions, (3) for each bin, computing (actual% - expected%) × ln(actual% / expected%), and (4) summing across all bins.

PSI interpretation: < 0.1 indicates insignificant change (no action needed), 0.1–0.25 indicates moderate change (investigate), > 0.25 indicates significant change (likely model retraining or pipeline investigation required).

PSI's advantage over the KS test for production monitoring: it's a single number with intuitive thresholds and a well-established interpretation framework (from decades of financial regulation). Its advantage over raw distributional statistics: it captures the overall shape change in a distribution, not just mean or variance shifts.

For ML-driven merge decisions in data versioning, PSI provides a quantitative quality gate. When merging a data branch back to main, compute PSI between the branch's feature distributions and the production baseline. If PSI > 0.25 for any critical feature, the merge is flagged for review. This is architecturally feasible (lakeFS webhooks can trigger PSI computation) but requires custom implementation — no platform provides it natively.


Quality Dimensions

The standard framework for categorizing aspects of data quality, typically comprising six dimensions. Different frameworks use slightly different definitions, but the core six are widely accepted:

Accuracy: Do data values correctly represent the real-world entity they describe? Is the customer's address actually their current address?

Completeness: Is all expected data present? Are there missing records, missing fields, or missing time periods?

Consistency: Are data values consistent across different representations and systems? Does the customer's name match between the CRM and the billing system?

Timeliness/Freshness: Is data available when it's needed, with acceptable recency? Is the data fresh enough for its intended use?

Uniqueness: Is each real-world entity represented exactly once? Are there duplicate customer records?

Validity: Do data values conform to defined formats, ranges, and business rules? Are all dates in the expected format? Are all prices positive?

These six dimensions were defined for human-consumed BI data. For AI workloads, additional dimensions are needed: Representativeness (does the dataset represent the full distribution of production inputs, including edge cases?), Temporal consistency (are features computed using only information available at prediction time, without data leakage?), Semantic correctness (do values mean what downstream systems assume they mean — not just structurally valid but semantically appropriate?), and Provenance (can each value be traced to its source, through every transformation, with confidence?).


Root Cause Analysis (Data Quality)

The process of tracing a detected data quality issue back to its originating cause — which may be several pipeline stages upstream from where the issue was detected. Root cause analysis is the most time-consuming phase of data incident resolution, often requiring manual investigation across multiple tools and systems.

When a data quality alert fires — say, the null rate in a revenue column spiked from 1% to 15% — the investigation traces backward: Was the null rate already high in the Silver layer? Yes → was it high in the Bronze layer? Yes → was the source system sending nulls? No, the source looks fine → was the ingestion connector handling a schema change incorrectly? Yes — a source system renamed revenue to total_revenue, Fivetran mapped the new column but left the old column's pipeline writing nulls.

This investigation requires querying across Fivetran (ingestion logs), Snowflake (table history, query history), dbt (model run logs, test results), and Monte Carlo (lineage graph, historical metrics). Each system has its own interface, its own log format, and its own understanding of the data estate.

Monte Carlo provides automated root cause analysis by combining lineage graphs (tracing data flow across tables) with anomaly correlation (identifying which upstream tables show related anomalies at similar times). This significantly reduces investigation time but still relies on metadata available to Monte Carlo — it can't trace into transformation logic (dbt model code) or external system behavior (source API changes).

For Autonomous DataOps, root cause analysis is automated by the metadata knowledge graph — a unified model that spans the entire data estate (sources, ingestion, transformation, quality, governance, consumption) with enough context to reason about causal chains across system boundaries.


Schema Conformance

The most basic level of data quality: does the data match the expected structure? Correct column names, correct data types, expected constraints (non-null, unique, foreign key). Schema conformance is the first checkpoint in any data quality pipeline because it's the cheapest to verify and the most consequential to miss.

Schema-on-write (enforced by warehouses and table formats at write time) prevents non-conforming data from entering the system. A Snowflake table rejects an INSERT where a column type doesn't match. Delta Lake's schema enforcement rejects writes that don't match the table's defined schema. This catches the most egregious errors immediately.

Schema-on-read (data lake pattern) allows any data into storage and applies schema interpretation at query time. This is flexible but dangerous — issues are discovered late, when a query or pipeline fails, rather than early, when bad data arrives.

Schema conformance is necessary but radically insufficient for AI quality. A table with perfect schema conformance — every column the correct type, no nulls where they shouldn't be, all foreign keys valid — can still produce terrible AI outputs. The embedding column might contain vectors from a deprecated model version (structurally valid float arrays, semantically useless). The text column might contain garbled OCR output (valid strings, meaningless content). The label column might contain outdated classifications (valid category values, incorrect assignments). Schema checks verify the container; AI quality requires verifying the content.


Schema Drift

Unannounced or unexpected changes to a data source's schema — new columns, renamed columns, changed data types, dropped columns, altered constraints. Schema drift is a leading cause of pipeline failures in the modern data stack because downstream transformations, tests, and dashboards assume a stable schema.

Schema drift originates at the source: a SaaS application adds a new field to its API response, a database administrator renames a column, an application team changes a column from INTEGER to BIGINT, a source system deprecates a field. These changes propagate through the pipeline as breaking changes — unless each layer handles drift gracefully.

How current tools handle drift: Fivetran automatically adds new columns and handles certain type changes, propagating schema changes to the destination. dbt does not handle drift automatically — if an upstream table adds or renames a column, dbt models referencing the old schema break on the next run, requiring manual model updates. Snowflake streams can detect schema changes through metadata inspection. Delta Lake provides schema evolution (automatic column addition) and schema enforcement (rejection of non-conforming writes), configurable per table.

In the context of data versioning, schema drift between branches creates merge conflicts. If Branch A adds a middle_name column and Branch B adds a phone_type column, merging both into main requires combining the schema changes — analogous to code merge conflicts but for table structures. No current data versioning tool resolves schema-level merge conflicts automatically.


Self-Healing Pipeline

A data pipeline that detects failures, diagnoses root causes, and remediates automatically — without human intervention. Self-healing goes beyond monitoring (detecting the problem) and alerting (notifying a human) to include the remediation step that currently consumes the majority of data engineering time.

Self-healing behaviors for common failure modes:

Source unavailable: Switch downstream consumers to cached/stale state with staleness metadata (so consumers know the data is stale) → attempt reconnection with exponential backoff → on recovery, compute the delta between cache and current source state → backfill the gap → validate consistency → resume normal operation.

Schema change at source: Detect the change (new column, type change, rename) → generate a migration plan (add column to downstream tables, update transformations, adjust quality rules) → apply the migration atomically → backfill affected records → validate that downstream outputs match expected results.

Quality regression detected: Identify the specific quality dimension that degraded → trace root cause through the pipeline → if the cause is upstream and outside the pipeline's control, quarantine affected data and alert the source team → if the cause is internal (transformation bug, stale model), roll back to the last known-good version and trigger a re-evaluation.

Current tools provide the monitoring layer: Monte Carlo detects the problem, PagerDuty notifies the engineer, and the engineer spends 2–8 hours diagnosing and fixing. Self-healing automates the 2–8 hour remediation loop. The gap between detection and remediation is where data engineers spend 30–50% of their time — and closing this gap is a core capability of Autonomous DataOps.


Service Level Agreement (SLA) / Service Level Objective (SLO)

Formal or operational commitments defining expected data quality levels — freshness targets, completeness thresholds, error rate bounds, and availability guarantees.

An SLO is an internal target: "the revenue table will be refreshed within 30 minutes of source changes, with less than 0.1% null rate in the revenue column, available for querying 99.9% of the time." An SLA is a contractual commitment with consequences for violations (contractual in the case of vendor SLAs, or operational in the case of internal team agreements).

Data SLOs are increasingly formalized through data contracts — explicit agreements between data producers and consumers specifying quality expectations. SLO violations trigger escalation workflows: alert the data team, escalate to the engineering manager if unresolved within 30 minutes, escalate to the VP if unresolved within 2 hours.

For AI, SLOs need to extend beyond table-level metrics to chain-level guarantees: "end-to-end latency from source change to agent action must be under 5 minutes" (spanning ingestion, transformation, feature computation, model inference, and action execution). "Model prediction accuracy on the golden set must remain above 92% after any data change." "Agent actions must pass safety evaluation 99.5% of the time." These cross-system SLOs require monitoring that spans the full AI quality chain — which no current tool provides.


Threshold Tuning

The process of configuring the sensitivity parameters for anomaly detection — determining how much deviation from baseline constitutes an alert-worthy issue. Threshold tuning is one of the most operationally painful aspects of data quality monitoring.

Set thresholds too tight: the system alerts on normal data variation, producing false positives that overwhelm the team and cause alert fatigue. Set thresholds too loose: real quality issues pass undetected until they cause downstream damage. The optimal threshold depends on the specific metric, the table's natural variability, the cost of false positives (wasted investigation time) vs. false negatives (undetected data corruption), and the downstream impact of quality issues.

Static thresholds (fixed numbers) require per-table, per-metric manual configuration and constant adjustment. ML-based approaches (Monte Carlo, Anomalo) learn dynamic thresholds from historical data, adapting to seasonal patterns and trend changes. But even ML-based thresholds require tuning: the confidence level for alerting, the training window length, the sensitivity to rare events, and the handling of concept drift in the metric itself.

For Autonomous DataOps, threshold tuning is automated through feedback loops: the system proposes thresholds, humans (or downstream quality evaluations) confirm or adjust them, and the system learns from the feedback to improve future threshold proposals. This supervised-then-autonomous approach mirrors how human engineers currently develop intuition about their data — but encodes that intuition in the system rather than in individual engineers' heads.


Part of The Sciencer Wiki. Maintained by The Sciencer Company.