Query Processing & Optimization

Technical definitions for query processing concepts: how databases parse, optimize, and execute queries — from partition pruning and join algorithms to federated query decomposition, AI-integrated SQL functions, serverless execution models, and the cross-modal query planning that AI workloads demand.

Query Processing & Optimization

Between your SQL statement and the result set lies an extraordinarily sophisticated pipeline: lexical parsing, semantic analysis, logical plan generation, cost-based optimization, physical plan selection, parallel execution, and result assembly. This pipeline — refined over five decades of database research — is one of the most impressive engineering achievements in computer science.

It is also, almost entirely, optimized for one consumer: a human being writing SQL queries about structured data.

Understanding how query processing works — in detail, at the level of join algorithms, cost models, and execution strategies — reveals both why current systems are so effective for BI and exactly where they fail for AI workloads. The optimizer that brilliantly chooses between a hash join and a merge join has no concept of an embedding function, a vector search, or an LLM inference call. The cost model that accurately estimates the I/O of scanning a columnar partition has no basis for estimating the latency of a document parsing operation. Every gap maps to a concrete limitation that AI agents encounter today.


ACID Properties

The four guarantees that a database transaction must satisfy for reliable, correct operation — even in the presence of concurrent access, system failures, and partial execution.

Atomicity: A transaction is indivisible. Either all operations within it complete successfully, or none do. If a multi-step pipeline write fails on step 3 of 5, steps 1 and 2 are rolled back — the database returns to its pre-transaction state. Atomicity prevents partial writes, which are among the most insidious sources of data corruption in distributed systems.

Consistency: Every transaction moves the database from one valid state to another, enforcing all defined constraints — data types, uniqueness, foreign keys, check constraints. A transaction that would violate a constraint is rejected. Consistency ensures that the database's invariants (business rules encoded as constraints) always hold.

Isolation: Concurrent transactions don't observe each other's intermediate states. Two transactions running simultaneously produce the same result as if they executed sequentially. Different isolation levels (Read Uncommitted, Read Committed, Repeatable Read, Serializable, Snapshot Isolation) trade strictness for performance — weaker isolation allows more concurrency but permits more anomalies.

Durability: Once a transaction is committed, it survives any subsequent failure — power outages, crashes, hardware faults. This is achieved through write-ahead logging: the transaction's changes are written to a persistent log before the commit is acknowledged. Recovery replays the log to restore committed changes that hadn't been flushed to data files.

ACID was originally a property of OLTP databases, but the lakehouse revolution extended ACID to analytical workloads through table format transaction logs (Delta Lake, Iceberg). For AI agents composing operations across multiple systems — reading from a warehouse, writing to an operational store, logging to an audit trail — ACID guarantees must span the distributed stack, not just a single database. This requires distributed transaction protocols (2PC) or application-level compensating transactions, both of which add latency and complexity.


Adaptive Query Execution (AQE)

A query optimization technique where the execution engine collects runtime statistics during query execution and adjusts the remaining execution plan accordingly. Unlike traditional optimization, which commits to a complete plan before execution begins, AQE makes plan decisions incrementally as actual data distributions are observed.

Spark's AQE (available since Spark 3.0, used by Databricks) implements three key optimizations. Coalescing shuffle partitions: after a shuffle, AQE observes the actual data distribution and merges small partitions to reduce task overhead. Converting sort-merge joins to broadcast hash joins: if the optimizer overestimated a table's size and the shuffled data fits in memory, AQE switches to the faster broadcast join mid-execution. Skew join optimization: if a partition is disproportionately large (skewed), AQE splits it into smaller sub-partitions for parallel processing.

The significance of AQE is philosophical as much as technical: it demonstrates that query optimization doesn't have to be a one-shot decision. The optimizer can observe, learn, and adapt during execution — a principle that extends naturally to autonomous infrastructure. If a query engine can adapt its join strategy based on observed data distribution, a data platform can adapt its storage layout, compute allocation, and quality thresholds based on observed workload patterns. AQE is a microcosm of the autonomous operations principle applied to query execution.


ai_query() / Cortex AI Functions

SQL-callable AI functions that embed machine learning inference — document parsing, text classification, embedding generation, summarization — directly within SQL query execution. These functions represent the first production integration of AI capabilities into the query processing pipeline, though their integration remains shallow.

Snowflake's approach provides task-specific functions: AI_PARSE_DOCUMENT() extracts structured fields from documents, AI_COMPLETE() generates text, AI_CLASSIFY() categorizes text, AI_EXTRACT_ANSWER() answers questions from context, and AI_SENTIMENT() scores sentiment. These are invoked within standard SQL:

SELECT customer_name,
       AI_PARSE_DOCUMENT(contract_pdf, 'renewal_date') AS renewal
FROM customers
WHERE region = 'EMEA'

Cost is tracked separately from warehouse compute via CORTEX_AISQL_USAGE_HISTORY. Cortex Search — Snowflake's RAG-optimized retrieval service — is a separate service object, not integrated into the SQL query planner.

Databricks' approach uses ai_query(), a more flexible function that can invoke any Model Serving endpoint (including custom fine-tuned models), and task-specific convenience functions (ai_summarize, ai_classify, ai_extract) that remain in Public Preview. Databricks' ai_query() is more extensible than Snowflake's fixed function set but less optimized for specific tasks.

The critical limitation of both approaches: AI functions are opaque to the query optimizer. The cost-based optimizer has no model for AI function execution cost (latency, compute, token consumption). It can't decide "filter structured columns first to reduce the candidate set, then call AI_PARSE_DOCUMENT only on the matching rows" — a decision that could reduce AI inference costs by orders of magnitude. AI functions are bolted onto SQL, not woven into query planning.


Broadcast Join

A distributed join strategy where the smaller of two tables is copied ("broadcast") in its entirety to every node in the cluster, then each node performs a local hash join between the broadcast table and its local partition of the larger table. No data shuffling of the large table is required.

Broadcast joins are optimal when one table is small enough to fit in each node's memory — a common pattern in analytical queries where a large fact table (billions of rows) is joined to a small dimension table (thousands to millions of rows). The cost is the network transfer and memory consumption of broadcasting the small table; the savings are the elimination of shuffling the large table.

The optimizer decides between broadcast and shuffle-based joins using table size estimates. If the small table is below a configurable threshold (default 10 MB in Spark, adaptive in Snowflake), the optimizer chooses a broadcast join. AQE can override this decision at runtime if the actual shuffled data is smaller than the estimate.

For AI workloads, the broadcast join pattern extends to a concept not yet implemented: broadcasting a model or embedding function to data rather than moving data to a centralized inference service. If an AI function (document parser, classifier, embedder) could be distributed to data-local nodes, inference would execute where data resides — eliminating the network transfer of potentially large unstructured content. This is the "push computation to data" principle that the Cambridge Report's federation research explores.


Caching Hierarchy

The multi-level system of data caches that modern disaggregated databases use to compensate for the latency gap between cloud object storage (~10–100ms) and local SSD or memory (~0.01–0.1ms). Effective caching is the primary reason disaggregated architectures achieve acceptable query performance despite reading from remote storage.

Snowflake implements a three-level cache. Result cache (shared, cross-warehouse): if the exact same query has been executed in the last 24 hours and the underlying data hasn't changed, the cached result is returned instantly — zero compute cost. Local disk cache (per-warehouse SSD): micro-partitions read from S3 are cached on the warehouse's local SSD. Subsequent queries reading the same data hit the local cache instead of S3. Metadata cache: partition-level statistics (min/max, Bloom filters) are cached in memory, enabling partition pruning without accessing storage at all.

Databricks implements similar layers: Delta Cache (local SSD caching of Parquet data on worker nodes), Photon's in-memory cache (columnar data retained in memory across queries), and result caching (reuse of previously computed query results).

For AI workloads, caching behavior changes because access patterns change. BI queries exhibit temporal locality (the same dashboards query the same recent data) and column locality (aggregation queries touch the same few columns). AI agent queries exhibit less predictable patterns — an agent might scan different columns, access historical data for context, or traverse relationships that span the table. Cache hit rates optimized for BI access patterns may degrade significantly under AI workload profiles, requiring different caching strategies (semantic caching, embedding caching, context caching) that no current system implements.


Cardinality Estimation

The process by which the query optimizer predicts how many rows each operation in a query plan will produce — the single most important (and most error-prone) component of cost-based optimization.

Consider a query joining three tables with filters: SELECT ... FROM orders o JOIN customers c ON o.cust_id = c.id JOIN products p ON o.prod_id = p.id WHERE c.region = 'EMEA' AND p.category = 'Electronics'. The optimizer must estimate: how many rows pass the region = 'EMEA' filter? How many rows result from the orders-customers join after filtering? How many survive the final product join and category filter? Each estimate depends on the previous one, and errors compound multiplicatively — a 2x overestimate at each of three stages produces an 8x error in the final estimate.

Estimation techniques: Histogram-based (look up the selectivity of region = 'EMEA' in the region column's histogram), Uniformity assumption (if no histogram exists, assume values are uniformly distributed — a 10-value column has 10% selectivity per value), Independence assumption (assume filter predicates are independent — P(region='EMEA' AND category='Electronics') = P(region='EMEA') × P(category='Electronics')). The independence assumption is the most common source of catastrophic errors, because real data is correlated (EMEA regions may have different product category distributions than APAC).

Cardinality estimation errors are the primary cause of poor query plans. An underestimate might cause the optimizer to choose a nested-loop join (efficient for small inputs) for a large intermediate result, or to broadcast a table that doesn't fit in memory. Modern research addresses this with learned cardinality estimation — using ML models trained on query execution history to predict cardinalities — but no production analytical database uses learned estimators as the primary estimation method yet.


Correlated Subquery

A SQL subquery that references columns from the outer query, causing it to be re-evaluated for every row of the outer query. Correlated subqueries are a common source of performance problems because they transform what looks like a single query into a nested-loop execution: the inner query runs once per outer row.

-- Correlated subquery: inner query references outer's customer_id
SELECT c.name, c.total_spend
FROM customers c
WHERE c.total_spend > (
    SELECT AVG(total_spend)
    FROM customers c2
    WHERE c2.region = c.region  -- correlated reference
)

Modern optimizers aggressively decorrelate subqueries — transforming them into equivalent joins or aggregations that execute more efficiently. The query above would typically be rewritten as a join against a pre-computed per-region average. Snowflake, Databricks, and PostgreSQL all perform automatic decorrelation.

For AI infrastructure, correlated subqueries interact with AI functions in problematic ways. A query like WHERE AI_CLASSIFY(doc) = 'high_risk' AND AI_EXTRACT(doc, 'amount') > threshold invokes two AI functions per row. If the query is correlated with an outer table, the AI functions execute once per outer row per inner row — potentially invoking millions of inference calls. Current optimizers can't reason about this cost or optimize the invocation order (e.g., "run the cheaper classification first to filter, then run extraction only on classified documents").


Cost-Based Optimization (CBO)

The query optimization strategy used by all major analytical databases, where the optimizer generates multiple candidate execution plans for a SQL statement, estimates the cost of each, and selects the plan with the lowest estimated cost.

The optimizer operates in three phases. Logical optimization applies algebraic transformations that are always beneficial: push predicates down, eliminate redundant projections, simplify expressions, fold constants, decorrelate subqueries. Physical optimization maps logical operations to physical implementations: a logical JOIN becomes a hash join, merge join, or nested-loop join; a logical SCAN becomes an index scan, full scan, or partition-pruned scan. Cost estimation evaluates each candidate physical plan using statistical metadata: table row counts, column cardinality (number of distinct values), value distributions (histograms), null fractions, and average column widths.

The cost model estimates three dimensions: I/O cost (how much data must be read from storage — the dominant factor for large analytical queries), CPU cost (computation for filtering, aggregating, hashing, sorting), and network cost (data transfer between nodes in distributed queries). Each physical operation has a cost formula: a hash join's cost depends on the build-side size (must fit in memory) and the probe-side size (sequential scan). A sort-merge join's cost depends on both sides being sorted (or the cost of sorting them).

CBO's effectiveness depends critically on the accuracy of statistics. Stale or missing statistics lead to wildly inaccurate cardinality estimates, which cascade into poor plan choices — the "cardinality estimation problem" is considered one of the hardest open problems in database research.

For AI workloads, CBO has a fundamental gap: no cost model exists for AI operations. The optimizer can estimate the cost of scanning a 10 GB partition (I/O time based on throughput) but cannot estimate the cost of calling AI_PARSE_DOCUMENT() on 10,000 files (depends on document complexity, model size, GPU availability, token count). AI-integrated queries are planned without cost awareness of their most expensive operations.


Cross-Modal Query Planning

A proposed query optimization capability where the planner decomposes a query spanning multiple data modalities — structured tables, semi-structured JSON, unstructured documents, vector embeddings, graph relationships — into sub-plans routed to specialized execution engines, with unified cost estimation across all modalities.

Consider an agent query: "Find customers in EMEA whose contracts mention price increases above 5%, whose support tickets show negative sentiment, and whose behavior is similar to customers who churned last quarter." This query spans: structured columns (region, churn status), unstructured documents (contract PDFs), semi-structured data (support ticket JSON), AI inference (sentiment analysis, document parsing), and vector search (behavioral similarity).

A cross-modal query planner would: (1) filter structured columns first (cheapest — standard partition pruning), (2) apply vector similarity search to the filtered set (medium cost — index-assisted), (3) invoke AI sentiment analysis on the reduced candidate set (expensive — model inference), (4) parse contract documents only for sentiment-positive candidates (most expensive — document AI). This order minimizes total cost by reducing the candidate set at each stage before invoking more expensive operations.

No production system implements this. TiDB's optimizer routes between row and column stores (two modalities). Snowflake's FILE type + Cortex AI handles structured + unstructured in limited cases. But true cross-modal planning requires cost models for AI operations, statistics for unstructured content (e.g., average document complexity), and optimization strategies that span radically different execution paradigms — SQL execution, vector index traversal, and LLM inference — within a single plan. The Cambridge Report identifies this as a frontier research challenge.


Data Skipping

A query optimization technique where the execution engine uses stored metadata to avoid reading data that cannot possibly match the query's filter predicate. Data skipping is the umbrella term for a family of techniques — partition pruning, zone maps, min/max filtering, Bloom filter checks — that all share the same principle: use cheap metadata checks to avoid expensive data reads.

Data skipping relies on statistics collected at write time and stored alongside data. When a Parquet row group records that its date column has min=2025-01-01 and max=2025-01-31, a query filtering on WHERE date = '2025-06-15' skips the entire row group without reading any actual values. When a Delta Lake file's zone map shows that its user_id column ranges from 'A' to 'M', a query for user_id = 'Z' skips the file.

The layers of data skipping, from coarsest to finest: File-level (skip entire files based on per-file statistics — Delta Lake, Iceberg), Partition-level (skip micro-partitions based on per-partition metadata — Snowflake), Row-group-level (skip Parquet row groups based on per-group column statistics), Page-level (skip individual pages within a column chunk), and Bloom-filter-level (test specific values against a probabilistic membership structure for high-cardinality columns).

Effective data skipping can reduce I/O by 90–99% for selective queries. It's the primary reason analytical databases can process petabyte-scale tables in seconds: they read only the fraction of data relevant to the query. Data skipping effectiveness depends entirely on data layout — which is why clustering (Liquid Clustering, automatic micro-partition organization, Iceberg sort orders) exists as a feature category: it organizes data to maximize skipping effectiveness for common query patterns.


Execution Engine

The component of a database system that physically executes a query plan — reading data from storage, applying filters, performing joins, computing aggregations, and producing results. The execution engine is where the optimizer's plan becomes actual computation.

Three execution paradigms exist. Volcano/Iterator model: each operator (scan, filter, join, aggregate) implements an open()/next()/close() interface. The top-level operator calls next() on its child, which calls next() on its child, recursively. Each next() returns one row (tuple-at-a-time). Simple, composable, but slow due to per-tuple function call overhead. Used by PostgreSQL and most traditional databases.

Vectorized execution: operators process batches (vectors) of 1,024–10,000 values. A single next() call returns a batch of rows, amortizing function call overhead and enabling SIMD parallelism. Used by DuckDB, Snowflake, Databricks Photon, ClickHouse, and Velox.

Compiled execution: the query is compiled to native machine code (via LLVM or similar) before execution. Eliminates all interpretation overhead. Used by HyPer, and partially by Databricks Photon (which compiles critical paths). Achieves the highest per-core throughput but incurs compilation latency.

Modern high-performance engines typically combine approaches. Photon uses a vectorized architecture written in C++ with some JIT compilation. DuckDB uses pure vectorized execution without compilation. The choice affects AI workload performance: vectorized engines process homogeneous columnar data efficiently, but AI functions (which process heterogeneous unstructured content per-row) don't benefit from vectorization in the same way — each document or image requires independent processing.


Federated Query / Query Federation

A query processing architecture where a single query accesses data across multiple heterogeneous sources — different databases, file systems, APIs, or compute engines — without the user manually moving data between them. The federation layer decomposes the query into source-specific sub-plans, routes each to the appropriate source, executes them in parallel where possible, and assembles the results.

Current federation implementations: Trino (formerly PrestoSQL) connects to 40+ data sources and executes SQL across them. Snowflake External Tables provide read access to data in external object storage. Databricks Lakehouse Federation queries PostgreSQL, MySQL, SQL Server directly from Spark SQL. BigQuery Omni extends BigQuery to data in AWS and Azure.

The primary optimization for federated queries is predicate pushdown: sending filter conditions to the remote source to minimize network transfer. More sophisticated pushdown includes aggregation pushdown (compute partial aggregates at the source), projection pushdown (request only needed columns), and join pushdown (execute joins at the source if both tables are co-located).

The Cambridge Report identifies federation as increasingly important for regulated environments: "data residency laws require data not to leave specific jurisdictions." In this model, you federate the query (send computation to where data lives), not the data (move data to where computation lives).

For AI agents, federation needs to extend beyond SQL: pushing AI functions to remote data sources, executing vector searches against remote stores, and maintaining governance and lineage across federated boundaries. No current federation engine handles AI-native operations.


Hash Join

A join algorithm optimized for equi-joins (joins where the condition is equality: ON a.id = b.id) that works in two phases. The build phase reads the smaller table and constructs an in-memory hash table keyed on the join column. The probe phase scans the larger table and looks up each row's join key in the hash table — a constant-time operation.

Hash joins are the most common join algorithm in analytical databases because most analytical joins are equi-joins, and the hash table enables O(n + m) performance. The constraint: the hash table must fit in memory. If the build-side exceeds available memory, the engine spills to disk — partitioning both tables by hash value, writing partitions to disk, and joining matching partitions. Spilling dramatically increases I/O.

For distributed queries (MPP systems), hash joins require shuffling: redistributing both tables by the join key across nodes so matching keys are co-located. Shuffle I/O is often the dominant cost. AQE can convert shuffle-hash joins to broadcast joins when one side is small enough, eliminating the shuffle.


Join Ordering

The problem of determining the sequence in which multiple tables are joined in a multi-table query — one of the most important and computationally expensive decisions in query optimization.

For a query joining N tables, the number of possible join orderings grows super-exponentially. A 5-table join has 1,680 possible orderings; a 10-table join has over 17 billion. The differences between orderings can be enormous (100x or more) because intermediate result sizes vary dramatically.

The optimizer uses two strategies. Dynamic programming (PostgreSQL, Snowflake for small joins): enumerate all orderings bottom-up, pruning suboptimal partial plans. Guarantees optimality but is exponential — impractical beyond ~15 tables. Greedy/heuristic approaches (most engines for large joins): start with the most selective join and greedily extend. Fast but may miss the global optimum.

For AI queries, join ordering becomes more complex when AI operations are involved. A query joining structured data with AI-derived results must decide: apply the AI function first (reducing the candidate set for the join) or join first (reducing the input to the AI function)? This depends on the AI function's cost and selectivity — statistics the optimizer doesn't have.


Materialized View

A database object that stores the pre-computed results of a query, enabling the database to serve results directly rather than re-executing the underlying query. Materialized views trade storage space and maintenance cost for query latency.

Snowflake supports automatic materialized views (AMV) maintained transparently as underlying data changes. Databricks supports materialized views in Unity Catalog with automatic refresh. When a new query matches a materialized view's definition, the optimizer rewrites the query to read from the view — "materialized view rewriting."

The AI-native extension is incremental materialization: views maintained by processing only changed records. If a source with 100 million rows receives 1,000 new records, the view updates by processing only those 1,000. This enables near-real-time freshness without full recomputation.

For AI workloads, materialized views could pre-compute expensive operations: embeddings over documents (materialize once, update incrementally), feature aggregations, and classification results. No current system materializes AI function results incrementally.


Merge Join (Sort-Merge Join)

A join algorithm that sorts both input tables on the join key and merges them in a single linear pass. Optimal when both inputs are already sorted (from an index, a prior sort, or physical layout). Cost after sorting: O(n + m) for the merge itself.

Merge joins produce sorted output (useful for downstream ORDER BY or window functions), handle non-equi joins (range conditions, inequality) more naturally than hash joins, and have a lower memory footprint (sequential access to sorted streams, no hash table). In distributed systems, merge joins avoid shuffles if tables are co-partitioned on the join key.

The trade-off vs. hash joins: if sorting is required, its O(n log n + m log m) cost often exceeds the hash table construction cost. Hash joins are therefore the default in analytical databases, with merge joins chosen when data is pre-sorted or when the query requires sorted output.


Nested-Loop Join

The simplest and most fundamental join algorithm: for each row in the outer table, scan the inner table for matching rows. The theoretical cost is O(n × m) — proportional to the product of both table sizes — making it prohibitively expensive for joining two large tables.

Despite this worst-case complexity, nested-loop joins are the optimal choice in specific scenarios. Indexed lookups: if the inner table has a B-tree or hash index on the join column, each inner lookup takes O(log n) or O(1) rather than O(n), reducing total cost to O(n × log m) or O(n). This is the primary join strategy in OLTP databases where one side is typically a single row (e.g., SELECT * FROM orders WHERE customer_id = 12345 joining against the customers table via an index). Tiny tables: if either table has only a handful of rows, the simplicity of nested-loop (no hash table construction, no sorting) outweighs the O(n × m) cost.

For AI queries, nested-loop patterns emerge naturally and unavoidably. A query like SELECT *, AI_CLASSIFY(document) FROM documents WHERE region = 'EMEA' is effectively a nested loop where each row triggers an independent AI inference call. Current engines execute this sequentially because AI functions are opaque and can't be vectorized — each document has unique content requiring independent processing. Optimizing this pattern (batching documents into a single inference call, caching repeated classifications, parallelizing across GPU instances) requires AI-aware execution strategies that no current engine provides.


Optimistic Concurrency Control (OCC)

A concurrency strategy where transactions proceed without acquiring locks, perform their work optimistically, and detect conflicts only at commit time. If no conflicting writes occurred during execution, the commit succeeds. If conflicts are detected, the transaction is aborted and retried.

Delta Lake and Apache Iceberg both use OCC for concurrent writes. When two writers add files to the same table concurrently, each writes independently, then attempts to commit by updating the transaction log. The commit succeeds if the new files don't conflict (appends rarely conflict; updates to the same partition may conflict).

OCC works well when write conflicts are rare — the common case for analytics where most writes are append-only. For AI agents that may write concurrently to shared datasets, update overlapping records, or modify the same metadata, conflict rates increase and retry overhead degrades throughput. No current table format provides conflict resolution strategies optimized for agent write patterns.


Partition Pruning

A query optimization where the engine uses stored metadata to eliminate entire partitions from a scan based on query predicates. In Snowflake, each micro-partition stores per-column min/max values, null counts, and Bloom filter membership. A query filtering on WHERE date BETWEEN '2025-01-01' AND '2025-03-31' checks each partition's date range — partitions outside this range are pruned without reading any data.

Pruning effectiveness depends on data layout. Clustered data (date values concentrated within partitions) enables 95%+ pruning. Random data (every partition spans the full value range) enables zero pruning. This is why Snowflake's automatic clustering and Databricks' Liquid Clustering exist — they optimize data layout to maximize pruning effectiveness.

Multiple mechanisms layer: range pruning (min/max for range and equality predicates), Bloom filter pruning (membership testing for high-cardinality equality predicates), and zone map pruning (file-level statistics in Delta Lake). Pruning is maximally effective for BI queries (date ranges, category filters). It's less effective for AI patterns: vector similarity (no meaningful min/max for embeddings), multi-column combinations, and unstructured content predicates.


Predicate Pushdown

An optimization where filter conditions are pushed as deep as possible into the execution tree — to the storage layer, within-file statistics, or remote sources in federated queries. The principle: filter early, transfer and process less.

Pushdown operates at multiple levels: Storage-level (predicates against file metadata — skip files), Scan-level (evaluate predicates during data decoding — emit only matching rows), Federation-level (send predicates to remote sources — minimize network transfer).

More sophisticated pushdown includes aggregation pushdown, projection pushdown (request only needed columns), join pushdown (execute co-located joins remotely), and limit pushdown (stop after N rows at the source).

For AI federation, pushdown would need to extend to AI operations: pushing an embedding function to a remote document store (embed at source, transmit vectors), pushing classification to a remote database (classify at source, transmit labels). This "AI function pushdown" requires standardized interfaces for deploying AI models at heterogeneous sources — a capability that doesn't exist today.


Query Plan

The complete execution strategy a database generates for a SQL statement — a tree of physical operations specifying how data will be read, filtered, joined, aggregated, sorted, and returned.

Production stages: Parsing (SQL text → abstract syntax tree), Semantic analysis (resolve references, check types), Logical plan (tree of logical operators — Scan, Filter, Join, Aggregate), Optimization (predicate pushdown, join reordering, physical operator selection with CBO), Physical plan (final operator tree with parallelism, distribution, and execution order).

In distributed systems, the plan includes exchange operators (shuffles, broadcasts) between stages. Snowflake's Query Profile and Databricks' SQL execution plan expose plans graphically.

For AI infrastructure, current planners only generate plans for SQL operations. AI operations appear as opaque function calls. The planner can't reorder AI operations based on cost, can't parallelize AI calls across partitions, and can't cache or reuse AI results — the single biggest query processing limitation for AI workloads.


Query Rewriting

The optimizer's process of transforming a SQL query into an equivalent but more efficient form before generating an execution plan. Rewrites preserve correctness while improving performance.

Common rewrites: predicate pushdown, constant folding (1+1=2true), expression simplification, subquery decorrelation, view merging (inlining views for global optimization), materialized view substitution, and predicate inference (deriving additional predicates from join conditions).

For AI queries, novel rewrites would be valuable: "if an AI function follows a selective filter, apply the filter first" (reducing invocations), "if the same document is processed by multiple AI functions, batch them" (reducing inference overhead), "if AI function output feeds a simple predicate, push the predicate into the AI function" (early termination). These require the optimizer to understand AI function semantics — currently unexplored territory.


Result Cache

An optimization where the database stores complete results of previously executed queries and returns cached results for identical subsequent queries — zero compute cost, millisecond latency.

Snowflake's result cache persists for 24 hours, shared across all warehouses. A hit requires exact query match and unchanged underlying data. Hit rates of 30–70% are common for dashboard-heavy BI workloads.

For AI agents, result cache effectiveness drops because agents generate diverse queries with unique parameters. Semantic result caching (matching semantically equivalent queries) and partial result caching (caching intermediate results like embeddings) could restore benefits — neither exists in production.


Serverless Query Execution

A compute model where the database provisions resources per-query automatically, without requiring users to configure warehouses or clusters. Billing is per-query (data scanned or compute consumed) rather than per-hour (cluster uptime).

Databricks' Serverless SQL Warehouses auto-provision per-query. BigQuery has been serverless since inception. Snowflake's Snowpark Container Services provides serverless non-SQL compute.

Serverless eliminates sizing (choosing warehouse size) and scheduling (starting/stopping compute). For AI agents — generating unpredictable query volumes with bursty patterns — serverless is a significantly better fit than pre-provisioned warehouses. The remaining gap: current serverless models optimize for SQL queries, not heterogeneous compute (SQL + AI inference + vector search).


SQL (Structured Query Language)

The standard interface for relational database operations: querying (SELECT), modifying (INSERT, UPDATE, DELETE, MERGE), defining (CREATE, ALTER, DROP), and access control (GRANT, REVOKE). SQL has been the dominant data access language since IBM's System R prototype in the 1970s and remains the interface for virtually every analytical database in production today.

SQL's enduring success comes from its declarative nature: you specify what data you want (SELECT revenue FROM sales WHERE region = 'EMEA' GROUP BY quarter), not how to get it. The optimizer determines the execution strategy — which indexes to use, which join algorithms to apply, how to parallelize the scan. This separation of intent from implementation enables the database to optimize execution as data volumes, distributions, and hardware change, without the user modifying the query. A SQL statement written in 1990 can run on a 2025 cloud warehouse without changes, potentially executing 1000x faster thanks to optimizer improvements.

SQL is also standardized (ISO/IEC 9075), though every major database extends the standard with proprietary features: Snowflake's QUALIFY, Databricks' TRANSFORM, PostgreSQL's LATERAL, BigQuery's STRUCT operations. The practical implication: SQL is portable in principle but vendor-specific in practice.

The Cambridge Report explicitly acknowledges SQL's limits for AI workloads: "the idea of a single, universal language or paradigm (e.g., extending SQL) covering all data programming needs is unlikely, due to the diversity and specialization of data science tasks." This is a remarkable statement from the researchers who built SQL-based systems. SQL was designed for structured, tabular data with well-defined schemas. AI agents need vector similarity search (no SQL equivalent), graph traversal (SQL can express recursion but not efficiently), document parsing (requires UDF wrappers), and ML inference (external function calls). Each of these operations has its own natural query paradigm — similarity thresholds for vectors, pattern matching for graphs, natural language for documents.

The future likely involves SQL as one interface within a multi-modal query system — SQL for structured analytics, vector query syntax for similarity search, graph query languages (Cypher, SPARQL, GQL) for relationship traversal, and natural language for unstructured reasoning — unified by a cross-modal query planner that decomposes requests into the optimal combination of specialized sub-plans.


Statistics / Table Statistics

The metadata about data distribution that the query optimizer uses to estimate the cost of different execution plans. Accurate statistics are the foundation of effective cost-based optimization — without them, the optimizer is making educated guesses that can be wildly wrong.

Key statistics maintained by modern databases: Row count (total rows in a table or partition — the starting point for all cardinality estimates), Column cardinality (number of distinct values — critical for join selectivity: joining on a column with 10 distinct values produces very different intermediate sizes than one with 10 million), Null fraction (percentage of null values — affects filter selectivity for IS NULL/IS NOT NULL), Histograms (distribution of values within a column — equi-depth or equi-width buckets that enable accurate selectivity estimation for range predicates like WHERE price BETWEEN 100 AND 200), Min/max values (per column, per file or partition — enables partition pruning and data skipping), and Average column width (bytes per value — affects I/O cost estimation for projection and transfer).

Snowflake maintains statistics automatically as part of micro-partition metadata — no manual ANALYZE needed. Each micro-partition's header contains per-column min/max, null counts, and distinct value estimates. Databricks requires explicit ANALYZE TABLE ... COMPUTE STATISTICS for detailed column statistics (histograms, cardinality) beyond the file-level min/max automatically maintained in the Delta transaction log. PostgreSQL's autovacuum process collects statistics periodically, including most-common-values lists and correlation statistics.

The accuracy of statistics degrades in predictable ways. Stale statistics (collected before significant data changes) lead to cardinality underestimates or overestimates that cascade through join ordering and algorithm selection. Missing statistics (on columns not yet analyzed) force the optimizer to use default assumptions (often 10% selectivity for equality predicates, 33% for range predicates) that can be wildly wrong. Correlated columns (where knowing one column's value constrains another's — e.g., city and state) defeat independent-column assumptions that most optimizers rely on.

For AI queries, statistics have a fundamental gap: no database collects statistics on unstructured content (average document length, text complexity distribution, language mix, entity density) or on AI function behavior (average inference latency by document type, output cardinality, cost per invocation, result distribution). Without these, the optimizer can't make informed decisions about AI operation placement, ordering, or parallelism — which is why AI-integrated queries execute with naive, unoptimized plans.


UDF (User-Defined Function)

A function written by the user (rather than built into the database) that can be invoked within SQL queries to extend SQL's capabilities beyond its built-in operations. UDFs enable custom logic — text parsing, business rules, mathematical functions, and increasingly, AI/ML inference — to execute within the query engine.

UDFs come in several varieties. Scalar UDFs accept a single row and return a single value (like a built-in function: MY_HASH(column) returns one value per row). Table-valued UDFs accept parameters and return a table (like Databricks' READ_FILES() or ai_query()). Aggregate UDFs accumulate state across multiple rows and return a single summary value (custom aggregations beyond SUM/AVG/COUNT). Window UDFs compute values across a sliding window of rows.

Implementation languages vary by platform: Snowflake supports JavaScript, Python, Java, Scala, and SQL UDFs. Databricks supports Python and Scala via Spark UDFs, including the higher-performance pandas_udf decorator for vectorized UDFs that process Arrow batches rather than individual rows (reducing serialization overhead by 10–100x for Python UDFs). Both platforms support external functions that call remote HTTP endpoints — enabling ML model invocation from SQL without bringing model code into the database.

The performance trade-off is significant. A Snowflake JavaScript UDF cannot be inlined into the vectorized execution engine — each row requires a context switch to the JavaScript sandbox. Standard PySpark UDFs serialize each row from the JVM to Python and back, adding milliseconds per row. Pandas UDFs mitigate this by processing Arrow batches but still incur JVM-Python boundary crossing.

For AI infrastructure, UDFs are the current mechanism for embedding AI into SQL — ai_query(), Cortex AI functions, and custom model-calling UDFs are all implemented as UDF variants. But UDFs are a workaround, not a solution. The optimizer treats them as opaque black boxes: it can't estimate their cost, can't reorder them relative to other operations for efficiency, can't push them down to remote data sources, and can't parallelize them beyond row-level distribution. Making AI a first-class query planning primitive requires moving beyond the UDF model to native AI operators with cost models, statistics, and deep optimizer integration.


Virtual Warehouse

Snowflake's compute abstraction: an independently scalable cluster of nodes processing queries against shared storage. Warehouses are sized from X-Small (1 credit/hour) through 6X-Large (512 credits/hour), each doubling resources. Multi-cluster warehouses auto-scale by adding clusters when queuing is detected.

The model was designed for BI: relatively few concurrent complex queries. AI agents generate the opposite pattern — many small fast queries at high concurrency. A warehouse sized for agent throughput wastes money on per-query compute; one sized for per-query cost can't handle concurrency. Serverless billing models are a better fit, but the underlying compute still optimizes for SQL analytics, not heterogeneous AI workloads.


Window Function

A SQL operation computing a value for each row based on a "window" of related rows — without collapsing rows (unlike GROUP BY). Window functions enable running totals, rankings, moving averages, and lag/lead analysis.

SELECT customer_id, order_date, revenue,
       SUM(revenue) OVER (PARTITION BY customer_id
                          ORDER BY order_date) AS running_total,
       RANK() OVER (ORDER BY revenue DESC) AS revenue_rank
FROM orders

Windows are defined by PARTITION BY (row grouping), ORDER BY (ordering within partitions), and frame specification (which rows relative to current row are included). Execution requires sorting and potentially large in-memory state.

For AI feature engineering, window functions compute temporal features (rolling averages, trends), behavioral sequences (lag/lead patterns), and rankings. The limitation: they operate on structured columns only. Computing "rolling sentiment" over interactions requires materializing sentiment scores first — the window can't invoke AI functions as part of its computation.


Write-Ahead Logging (WAL)

A durability mechanism where all database modifications are written to a sequential log on persistent storage before being applied to data pages. On crash, the log is replayed to reconstruct committed changes.

WAL operates as: (1) write operation → (2) log entry to WAL → (3) flush WAL to disk (durability point) → (4) modify data page in memory → (5) periodically checkpoint dirty pages to disk → (6) on crash, replay WAL from last checkpoint.

WAL is universal in OLTP databases. In HTAP, the WAL serves double duty: durability for OLTP and change source for the analytical store. TiDB's Raft log is simultaneously a consensus mechanism and WAL. CDC systems (Debezium) read the WAL to capture changes.

For AI-native infrastructure, WAL extends to the append-only commit log proposed for data versioning: every change to any data asset is recorded immutably, enabling full history, branching, and rollback — WAL semantics applied to the entire data estate.


Part of The Sciencer Wiki. Maintained by The Sciencer Company.