AI & Machine Learning Infrastructure

AI and ML workloads demand infrastructure capabilities that the BI-era data stack was never designed to provide. Vector storage and retrieval at millisecond latency. Real-time feature serving with training-serving consistency. Model lifecycle management across experiment, staging, and production. Agent orchestration with governance at machine speed. Evaluation loops that measure quality from data through model to action to outcome.

These aren't incremental extensions of BI infrastructure — they're a different category of system requirements. The BI stack was designed for humans running SQL queries against structured tables. AI infrastructure must support models consuming heterogeneous data at training time, serving predictions at inference time, powering agents that reason and act autonomously, and maintaining quality across a chain that no single tool spans.

This cluster covers the technical concepts at the intersection of AI and data infrastructure — the systems, architectures, and capabilities that define what's possible (and what's missing) for production AI.

Agentic AI

AI systems where autonomous agents plan, reason, use tools, and take actions to accomplish goals — as opposed to passive AI that simply responds to prompts with text output. An agentic AI system might receive the goal "resolve this customer complaint," then autonomously retrieve customer history, analyze sentiment patterns, query product documentation, generate a resolution plan, execute it by updating CRM records, send a notification, and monitor the outcome — all without human intervention at each step.

The distinction between "AI" and "agentic AI" is the action loop. Standard AI takes input and produces output (text, classification, prediction). Agentic AI takes a goal and produces actions — reading data, calling APIs, writing to systems, invoking other agents — in a self-directed loop that continues until the goal is achieved or the agent determines it cannot proceed.

Agentic AI is the workload that most severely stresses current data infrastructure. Databricks reported a 327% increase in multi-agent system deployments over a 4-month period in 2025. Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear value, and inadequate infrastructure. The gap between agentic AI ambitions and data infrastructure readiness is the central thesis of the AI data infrastructure problem.

For data infrastructure, agentic AI introduces five demands that BI infrastructure cannot satisfy: bidirectional data flow (agents read and write), multi-modal access (agents need structured, unstructured, and vector data), machine-speed governance (authorization decisions in milliseconds), compositional safety (evaluating action sequences, not individual accesses), and semantic observability (understanding why an agent acted, not just what it accessed).

AI Control Plane

A proposed infrastructure layer providing unified governance, observability, and coordination for AI systems operating on data infrastructure. The AI control plane is to AI agents what Kubernetes is to containers — the management layer that orchestrates, governs, and observes autonomous components.

Four components define the AI control plane:

Metadata knowledge graph: A queryable graph connecting data assets, compute resources, actors (humans and agents), actions, and outcomes. Unlike traditional catalogs (which model data assets in hierarchies), the knowledge graph captures relationships: this agent accessed this table to produce this prediction, which triggered this action, which resulted in this outcome. Queryable via graph languages (Cypher, SPARQL) or SQL for administrative use.

Action-level policy enforcement: Implements the propose → authorize → execute pattern for agent actions. Every agent action is a structured proposal evaluated against multi-dimensional policies (delegation authority, capability scope, data sensitivity, business rules, regulatory constraints) before execution is permitted.

Semantic observability: Captures structured traces — not just log entries — that record the agent's goal, data accessed, reasoning chain, tools invoked, action proposed, authorization decision, execution result, and downstream impact. Built on OpenTelemetry's distributed tracing model but with AI-specific span types and typed metadata.

Continuous evaluation: Automated quality assessment running scorers against agent outputs. Golden-set regression testing after any change. Human feedback collection and integration (via ALHF). Model performance monitoring with automatic alerting on degradation.

No complete AI control plane exists in production. Databricks' combination of Unity Catalog + MLflow 3.0 + Mosaic AI Gateway covers approximately 40% of the architecture. Microsoft's Foundry Control Plane (Public Preview) provides Entra Agent ID — the only production agent identity implementation — but operates at the prompt level, not the business action level, and is Azure-locked.

ALHF (Automated LLM-Human Feedback)

A framework — implemented in MLflow 3.0 — for aligning automated LLM-based evaluation with human preferences. ALHF addresses a critical scaling problem: human evaluation of AI outputs is the gold standard but costs $1–10+ per evaluation. LLM-as-judge evaluation scales to millions of evaluations at cents per evaluation — but LLM judges can disagree with human preferences systematically.

ALHF closes this gap through a three-step process. Collection: gather human evaluations on a representative sample of AI outputs (typically hundreds to low thousands). Alignment: train a calibration layer that adjusts the LLM judge's scoring to match human preference patterns — learning which dimensions humans weight differently from the base LLM judge. Scaling: apply the aligned judge to the full volume of production outputs, with periodic human re-calibration to detect judge drift.

The result is a scalable evaluation loop: expensive human feedback calibrates cheap automated evaluation, which enables continuous quality monitoring without requiring human review of every output. This loop is the practical mechanism for maintaining the "continuous evaluation" component of the AI control plane — you can't have humans evaluate every agent action, but you can't trust uncalibrated automated evaluation either. ALHF bridges the gap.

Approximate Nearest Neighbor (ANN) Search

A family of algorithms for finding vectors in a high-dimensional space that are "close" to a query vector, with the critical relaxation that results are approximately correct rather than exactly correct. This relaxation enables searches over billions of vectors in milliseconds — whereas exact nearest neighbor search requires comparing the query against every vector in the dataset.

The key ANN index types:

HNSW (Hierarchical Navigable Small World): Builds a multi-layer graph where each vector is connected to its neighbors. Search starts at the top layer (sparse, long-range connections) and descends through layers (progressively denser, shorter-range connections) to converge on the nearest neighbors. HNSW achieves high recall (95–99% of exact nearest neighbors found) with sub-millisecond query latency on millions of vectors. It's the default index type in most vector databases (Pinecone, Weaviate, Qdrant, pgvector).

IVF (Inverted File Index): Partitions the vector space into clusters using k-means. At query time, the search probes only the nearest clusters, reducing the comparison set. IVF is faster to build than HNSW but typically achieves lower recall at equivalent latency. Often combined with Product Quantization (PQ) for memory efficiency.

Product Quantization (PQ): Compresses vectors by dividing each vector into sub-vectors and quantizing each sub-vector using a codebook. Reduces memory by 4–32x at the cost of some recall accuracy. PQ is typically used in combination with IVF for large-scale deployments where memory is the constraint.

The recall-latency-memory trade-off in ANN search is fundamental: higher recall requires larger indexes (more memory) and more comparisons (higher latency). Production systems tune these trade-offs based on use case requirements — a recommendation engine might accept 90% recall for sub-millisecond latency, while a legal document search might require 99% recall at the cost of higher latency.

Compound AI Systems

AI systems that combine multiple components — LLMs, retrieval systems, code execution, tool use, classifiers, and traditional software logic — to accomplish tasks that no single model can handle alone. Most production AI applications are compound systems, not standalone model calls.

Examples: A RAG pipeline combines a retriever (vector search), a reranker (cross-encoder), and a generator (LLM). A coding agent combines a planner (LLM), code generator (LLM), code executor (sandbox), and test runner (deterministic). A customer service system combines a router (classifier), knowledge retriever (RAG), response generator (LLM), action executor (API calls), and quality evaluator (LLM-as-judge).

The Cambridge Report identifies compound AI systems as introducing "new trade-offs in latency and accuracy" and proposes that "drawing inspiration from traditional database transaction models, new abstractions and query optimization techniques may be needed to manage these AI-driven pipelines effectively." This is the key architectural insight: compound AI pipelines are, at their core, data processing pipelines — they read data, process it through a sequence of operations, and produce outputs. They need the same optimization (choosing the cheapest execution plan), governance (authorizing each operation), and observability (tracing execution for debugging) that database query pipelines receive.

For data infrastructure, compound AI systems create requirements at every layer: multi-modal data access (structured + unstructured + vector), cross-component lineage (tracing from source data through retrieval through generation to output), unified governance (applying consistent policies across retrieval, inference, and action), and end-to-end quality monitoring (evaluating the chain, not individual components).

Context Window

The maximum amount of text (measured in tokens) that a language model can process in a single forward pass — both the input prompt and the generated output must fit within the context window. The context window determines how much information the model can "see" at once.

Context window sizes have expanded dramatically: GPT-3 (2020) had 4,096 tokens (~3,000 words). GPT-4 (2023) offered 8K and 32K variants. Claude 3.5 supports 200K tokens. Google's Gemini 1.5 Pro handles up to 2 million tokens. Some models now support 128K+ tokens as standard.

For data infrastructure, context windows create a direct connection between data systems and model behavior. A larger context window means the model can process more retrieved documents in a RAG pipeline, more conversation history in a chat agent, or more data records in an analytical AI task. This has infrastructure implications: retrieving and formatting 100K tokens of context requires fast access to document stores, efficient chunking and ranking, and governance over what data enters the context.

The trade-off: larger context windows increase inference cost (attention computation scales quadratically with context length in standard transformers, though efficient attention variants reduce this) and may decrease accuracy for information positioned in the "middle" of long contexts (the "lost in the middle" phenomenon). Effective RAG architectures often achieve better results with a smaller context window and more precise retrieval than with a massive context window and loose retrieval — making retrieval quality (a data infrastructure problem) as important as model capability.

Cortex AI

Snowflake's integrated suite of AI and ML capabilities accessible via SQL functions within the Snowflake query engine. Cortex AI brings AI operations into SQL: users call AI functions in SELECT statements alongside standard SQL operations.

Task-specific functions: AI_TRANSCRIBE (speech to text), AI_PARSE_DOCUMENT (extract structured data from documents), AI_CLASSIFY (text classification), AI_SUMMARIZE (text summarization), AI_TRANSLATE (language translation). General-purpose functions: COMPLETE (LLM text generation with model selection), EMBED_TEXT (generate vector embeddings). Search capabilities: Cortex Search provides semantic retrieval over document collections.

Cortex AI's architectural significance: it's the most complete attempt to bring AI operations into a SQL-first analytical platform. Users can write queries like SELECT customer_name, AI_PARSE_DOCUMENT(contract_pdf, 'renewal_date') FROM customers WHERE region = 'EMEA' — parsing unstructured documents inline with structured analytics, governed by the same access controls.

The limitations reveal the BI-era architecture underneath. The query optimizer doesn't understand AI function costs — it can't determine whether filtering structured columns first (reducing the document set) or parsing all documents first (enabling semantic filtering) is cheaper. AI functions are treated as opaque scalar functions, not as optimizable operations within the query plan. Cost monitoring (CORTEX_AISQL_USAGE_HISTORY) is separate from warehouse compute credits. And Cortex Search is a separate service object, not integrated into the SQL query planner — you can't seamlessly combine vector search with structured predicates in a single optimized query.

Embeddings / Vector Embeddings

Dense numerical representations of data (text, images, audio, code, structured records) in a high-dimensional vector space, produced by neural networks trained to encode semantic meaning into geometric relationships. The core property: semantically similar inputs produce vectors that are geometrically close in the embedding space.

An embedding model maps input data to a fixed-dimensional vector: "The quarterly revenue exceeded projections" → [0.023, -0.156, 0.891, ...] (1,536 dimensions for OpenAI's text-embedding-3-small). Two texts about financial performance will produce vectors with high cosine similarity (>0.8), even if they use completely different vocabulary. Two texts about unrelated topics will have low similarity (<0.3).

Embedding models vary in dimensionality, quality, and specialization: OpenAI's text-embedding-3-small (1,536 dimensions), Cohere's embed-v3 (1,024 dimensions), open-source models like BGE and E5 (768–1,024 dimensions), and domain-specific models trained for medical, legal, or financial text.

For data infrastructure, embeddings create challenges the BI stack wasn't designed for. Storage: a table of 10 million documents with 1,536-dimensional embeddings requires ~57 GB just for the embedding column. Indexing: vector columns don't have meaningful min/max statistics, so columnar data skipping doesn't apply. Specialized indexes (HNSW, IVF) with fundamentally different properties from B-trees or columnar indexes are required. Quality: embeddings become stale when the generating model is updated — a new model version produces embeddings in a different vector space, making old and new embeddings incomparable. This "embedding model drift" is invisible to schema-based quality monitoring because the column type (array of floats) doesn't change.

Feature Store

An infrastructure component that manages the computation, storage, and serving of ML features — the input variables that models use for predictions. Feature stores solve the training-serving skew problem: features must be computed identically for both batch training (using historical data) and real-time inference (using current data). Divergence between training and serving feature logic is a common and insidious source of model degradation.

A feature store manages four concerns: Feature computation (SQL or Python transformations that derive features from raw data — e.g., "average order value over last 30 days"). Feature storage (a dual-store architecture: an offline store for batch training data, typically in a lakehouse, and an online store for low-latency serving, typically in Redis or DynamoDB). Feature serving (sub-millisecond feature retrieval for real-time model inference — the online store returns a feature vector for a given entity ID within 1–5ms). Feature consistency (the feature computation logic is defined once and used for both training and serving — eliminating the risk of different code paths producing different feature values).

Tecton, Feast (open source), and Databricks Feature Store are the primary implementations. Databricks Feature Store integrates with Unity Catalog for governance and MLflow for experiment tracking — the tightest data-platform-to-ML-platform integration available.

Feature stores sit at the boundary between data infrastructure and ML infrastructure — the exact boundary that current systems handle poorly. Features are derived from warehouse data (data infrastructure domain) but consumed by model serving (ML infrastructure domain). Governance, lineage, and quality monitoring must span both domains, which typically run on different platforms with different governance models.

Fine-Tuning

The process of further training a pre-trained model on a domain-specific or task-specific dataset to improve its performance on a particular use case. Fine-tuning adapts a general-purpose model (trained on broad internet data) to a specific domain (legal, medical, financial) or task (classification, extraction, summarization) using a much smaller, curated dataset.

Fine-tuning approaches span a spectrum of cost and effectiveness. Full fine-tuning: update all model parameters using domain-specific training data. Produces the highest-quality specialization but requires significant GPU compute and risks catastrophic forgetting (the model loses general capabilities). LoRA (Low-Rank Adaptation): freeze the original model weights and train small adapter matrices that modify the model's behavior. Dramatically reduces training compute (10–100x less) and memory while achieving 90–95% of full fine-tuning quality. Instruction tuning: train on input-output pairs formatted as instructions, teaching the model to follow specific task formats and output structures.

For data infrastructure, fine-tuning creates data management requirements: curated training datasets must be versioned (to reproduce training runs), quality-controlled (to prevent training on corrupted data), and governed (to prevent training on sensitive data without authorization). The training dataset's lineage — from source data through cleaning, labeling, and formatting — must be tracked end-to-end for auditability. These requirements sit squarely in the data infrastructure domain but are typically managed by ML teams using ad-hoc file management rather than proper data governance.

HNSW (Hierarchical Navigable Small World)

The dominant index structure for approximate nearest neighbor search in production vector databases. HNSW builds a multi-layer navigable graph over the vector space, enabling efficient search that starts with coarse, long-range navigation and progressively refines to local, fine-grained search.

The structure: vectors are organized into multiple layers. The top layer contains very few vectors with long-range connections (each vector is connected to distant neighbors). Lower layers contain progressively more vectors with shorter-range connections. The bottom layer contains all vectors with connections to their nearest neighbors. Search starts at the top layer, finds the approximately closest vector using long-range jumps, descends to the next layer, refines the search with medium-range connections, and continues descending until reaching the bottom layer where fine-grained nearest neighbors are identified.

HNSW's key parameters: M (maximum connections per node per layer — higher M improves recall but increases memory and build time), efConstruction (search effort during index building — higher values produce better graph quality but slower construction), and efSearch (search effort during queries — higher values improve recall but increase latency). Typical production settings achieve 95–99% recall with sub-millisecond query latency on millions of vectors.

The trade-off: HNSW indexes are memory-intensive. Each vector requires storage for the vector itself plus graph connectivity data. A 1-million-vector index with 1,536-dimensional vectors and M=16 requires approximately 7–8 GB of memory. For billion-vector collections, this pushes into terabytes of memory — driving either distributed index architectures or compressed variants (HNSW with Product Quantization).

Inference

The process of running a trained model to produce predictions, classifications, embeddings, or generated text from new input data. Inference is the runtime phase of ML — the point where the model is deployed and producing value (as opposed to training, where the model is learning from data).

Inference has fundamentally different computational characteristics from training. Training is throughput-optimized: process as much data as possible per second, batching inputs heavily, running for hours to days. Inference is latency-optimized: respond to each request as quickly as possible, often with strict latency SLAs (sub-100ms for real-time applications). Training is done periodically; inference runs continuously.

For LLMs, inference has two phases. Prefill: process the entire input prompt through the model's layers, computing attention across all input tokens. This phase is compute-bound and parallelizable. Decode: generate output tokens one at a time, each requiring a full model forward pass with attention over all previous tokens (via the KV-cache). This phase is memory-bandwidth-bound and inherently sequential.

Inference optimization techniques: Batching (processing multiple requests through the model simultaneously — amortizing fixed costs across requests). KV-cache (caching key-value pairs from the attention layers to avoid recomputation during decode). Quantization (reducing model precision from 32-bit to 16-bit, 8-bit, or 4-bit — reducing memory and increasing throughput at slight accuracy cost). Speculative decoding (using a smaller, faster model to draft candidate tokens that the larger model verifies in parallel).

For data infrastructure, inference creates a critical boundary: data flows from governed storage through feature computation into the inference engine, where it exits the data platform's governance perimeter. The predictions flowing back may contain derived PII, hallucinated content, or inappropriate outputs — quality issues that originate in the inference boundary but must be governed by the data infrastructure.

LLM (Large Language Model)

A neural network with billions of parameters trained on massive text corpora to understand and generate human language. LLMs use the transformer architecture and are trained using self-supervised learning (predicting the next token in a sequence) on datasets comprising significant fractions of publicly available internet text.

LLMs exhibit emergent capabilities that increase non-linearly with scale: reasoning, instruction following, few-shot learning, code generation, and multi-step problem solving. GPT-4, Claude, Gemini, Llama, and Mistral represent the current frontier, with parameter counts ranging from 7 billion (efficient models suitable for deployment on consumer hardware) to estimated hundreds of billions (frontier models requiring dedicated GPU clusters).

For data infrastructure, LLMs are significant in three roles. As consumers: LLMs power agents that read, reason about, and act on data — creating the agentic workload that drives the AI infrastructure thesis. As tools: LLM functions within SQL (Snowflake Cortex AI, Databricks ai_query) transform unstructured data into structured insights — parsing documents, classifying text, extracting entities. As infrastructure components: LLMs serve as evaluators (LLM-as-judge for quality assessment), classifiers (routing queries to appropriate systems), and optimizers (the Cambridge Report discusses LLMs as potential query optimizer components).

The key infrastructure challenge: LLMs operate at the boundary between data systems (which manage the data LLMs consume and produce) and compute systems (which run the inference). Governance, quality, and lineage must span this boundary — but current platforms govern data and inference separately.

LLM-as-Judge

An evaluation methodology where a language model is used to assess the quality of another AI system's outputs — scoring dimensions like relevance, faithfulness, helpfulness, harmlessness, and correctness. LLM-as-judge enables scalable evaluation of generative AI outputs that are too nuanced for rule-based metrics but too numerous for human review.

The approach: define evaluation criteria (e.g., "Rate the faithfulness of this answer to the provided context on a 1–5 scale"), construct a prompt with the criteria, the input, the output, and (optionally) a reference answer, and have the judge LLM produce a score with reasoning.

MLflow's genai.evaluate() framework implements LLM-as-judge evaluation with composable scorers: functions that take an AI system's output and return a quality assessment. Built-in scorers include faithfulness (does the output match the provided context?), relevance (does the output address the question?), and safety (does the output contain harmful content?). Custom scorers can evaluate domain-specific quality dimensions.

Limitations: LLM judges have known biases — preference for longer outputs, sensitivity to output ordering, and systematic disagreement with human preferences on specific quality dimensions. ALHF addresses these biases by calibrating LLM judges against human evaluations. Without calibration, LLM-as-judge can give organizations false confidence in AI quality — the evaluation says the system is performing well while users experience poor quality.

MLflow

The most widely adopted open-source platform for ML lifecycle management, originally created by Databricks. MLflow provides the instrumentation layer for tracking, managing, and evaluating ML systems from experimentation through production.

Core components: Tracking (logging parameters, metrics, artifacts, and code versions for each experiment run — enabling comparison across hundreds of experimental configurations). Model Registry (versioning models with stage transitions: None → Staging → Production → Archived, with approval workflows). Model Serving (deploying models as REST API endpoints with automatic scaling). Projects (packaging ML code for reproducible execution across environments).

MLflow 3.0's GenAI additions are architecturally significant for AI infrastructure: mlflow.genai.evaluate() provides composable scorers for LLM output quality, enabling automated evaluation of RAG pipelines, agents, and generative applications. mlflow.log_feedback() enables typed human assessments (thumbs up/down, Likert scales, free-text corrections) attached to specific traces or outputs. MLflow Tracing captures OpenTelemetry-compatible hierarchical span trees with typed span_type values (AGENT, LLM, TOOL, RETRIEVER, EMBEDDING, RERANKER, PARSER, CHAT_MODEL). Tracing auto-integrates with 20+ frameworks via mlflow.<library>.autolog(), making it the closest production implementation of semantic observability.

For the Autonomous DataOps thesis, MLflow is the strongest available building block for the AI control plane's evaluation and observability components — but it operates at the model/agent layer, not the data infrastructure layer. Connecting MLflow's model-level observations (this model's accuracy degraded) to data-level root causes (because a feature distribution shifted, because a source schema changed) requires the cross-system reasoning that the metadata knowledge graph provides.

MLOps

The set of practices for deploying and maintaining ML models in production reliably and efficiently. MLOps adapts DevOps and software engineering principles — CI/CD, monitoring, versioning, infrastructure as code, automated testing — to the ML lifecycle, which has unique challenges: models degrade over time (concept drift), training data changes (data drift), reproducibility requires capturing data + code + environment + hyperparameters, and model behavior is probabilistic rather than deterministic.

The MLOps lifecycle: Experimentation (data scientists iterate on features, models, and hyperparameters — tracked in experiment management tools). Training pipeline (automated, reproducible model training that runs on schedule or trigger — orchestrated by Kubeflow, Vertex AI Pipelines, or SageMaker Pipelines). Validation (automated testing of model quality against holdout sets, golden sets, and business metrics — gating promotion to production). Deployment (model serving with canary releases, A/B testing, and rollback capability). Monitoring (continuous tracking of prediction quality, data drift, model latency, and business impact — with alerting on degradation).

The MLOps ecosystem is fragmented: experiment trackers (MLflow, Weights & Biases, Neptune), feature stores (Tecton, Feast), model registries (MLflow, SageMaker), model monitors (Evidently, Arize, WhyLabs, Fiddler), and pipeline orchestrators (Kubeflow, Vertex AI, SageMaker). This fragmentation creates the same integration and shared-ontology challenges that plague the broader data stack.

Model Serving

The infrastructure for deploying trained ML models as production services that accept requests and return predictions in real-time. Model serving handles: loading model artifacts into memory (or GPU memory), managing compute resources (CPU for traditional ML, GPU for deep learning and LLMs), auto-scaling to match traffic patterns, batching requests for throughput efficiency, and delivering responses within latency SLAs.

Platform-native serving: Databricks Mosaic AI Model Serving provides serverless endpoints for models registered in MLflow, with auto-scaling, GPU support, and integration with the Mosaic AI Gateway for governance. Snowflake Cortex AI serves AI capabilities as SQL functions — the model is abstracted behind a function call, and serving infrastructure is fully managed. Amazon SageMaker Endpoints, Google Vertex AI Endpoints, and Azure ML Managed Endpoints provide cloud-native serving with platform-specific integrations.

For LLM serving specifically, specialized engines like vLLM (PagedAttention for efficient KV-cache memory management), TGI (Hugging Face's Text Generation Inference), and TensorRT-LLM (NVIDIA's optimized inference engine) achieve 2–10x throughput improvements over generic serving frameworks through attention optimization, continuous batching, and model parallelism.

Model serving creates a governance boundary: data flows from the governed data platform through feature computation into the serving layer, where it exits the data platform's governance perimeter. A model endpoint that receives governed features and returns predictions operates under different governance rules than the warehouse that computed those features. Unifying governance across this boundary — so that the same policies, audit trails, and quality monitoring apply to data access, feature computation, inference, and action execution — is a core architectural requirement for the AI control plane.

Mosaic AI Gateway

Databricks' runtime policy enforcement layer for model serving endpoints. The Gateway sits between clients (applications, agents) and model endpoints, inspecting and filtering traffic based on configurable policies.

Capabilities: PII filtering (via Microsoft Presidio — detecting and redacting personally identifiable information in model inputs and outputs). Safety guardrails (via Llama Guard or custom classifiers — blocking harmful content). Rate limiting (per user, per application, or per endpoint — preventing resource exhaustion). Payload logging (recording request/response pairs for compliance, debugging, and evaluation). Cost tracking (monitoring token usage and associated costs per endpoint).

The Gateway's architectural significance and limitation define the current state of AI governance in production. It operates at the endpoint level — inspecting text payloads flowing to and from models. This is content inspection, not action authorization. The Gateway can detect PII in a model's response text, but it can't evaluate whether the agent's intended use of that information is authorized. Detecting that a response contains a customer's address (content filtering) is fundamentally different from evaluating whether the agent should be issuing a refund to that customer (action authorization).

This distinction — between payload-level content filtering and semantic action-level governance — defines the gap between current AI guardrails and the AI control plane that production agents need.

Multi-Agent Systems

AI architectures where multiple specialized agents collaborate to accomplish complex tasks that exceed the capability of any single agent. Each agent has a defined role, access to specific tools, and a bounded capability scope. A coordinator (often called an orchestrator or router agent) decomposes tasks and delegates sub-tasks to specialists.

Architecture patterns: Sequential pipeline (Agent A produces output that Agent B consumes — simple, linear, but inflexible). Hierarchical delegation (a manager agent decomposes tasks and delegates to worker agents, which may further delegate — mirrors organizational hierarchy). Debate/consensus (multiple agents independently assess a situation and negotiate a consensus — improves accuracy for complex judgments). Reactive mesh (agents communicate through shared state or message passing, reacting to events and each other's actions — most flexible but hardest to govern).

Multi-agent systems grew 327% in a 4-month period per Databricks data, making them the fastest-growing AI deployment pattern. For data infrastructure, multi-agent systems amplify every governance challenge exponentially: each agent may need different data access privileges. The composition of individually safe agent actions can produce unsafe outcomes (the confused deputy at multi-agent scale). Tracing a decision through a chain of agent interactions requires semantic observability across the entire chain. And the total data access footprint of a multi-agent system (the union of all agents' data needs) may be far larger than any individual agent's access — creating a scope-of-access governance problem.

RAG (Retrieval-Augmented Generation)

An AI architecture pattern that combines information retrieval with LLM text generation. Instead of relying solely on the LLM's training data (which is static and potentially outdated), RAG first retrieves relevant documents from a knowledge base, then generates a response grounded in the retrieved context.

The RAG pipeline: Query processing (transform the user's question into a retrieval query — possibly rewriting, expanding, or decomposing it). Retrieval (search the knowledge base using vector similarity, keyword matching, or hybrid approaches to find relevant documents). Reranking (score and reorder retrieved documents by relevance using a cross-encoder model — more accurate than embedding similarity alone but computationally expensive). Generation (pass the retrieved documents as context to the LLM along with the original question, generating a response grounded in the provided documents).

RAG is the dominant architecture for enterprise AI because it addresses three fundamental limitations of standalone LLMs: Hallucination (grounding the LLM in actual documents reduces fabrication). Proprietary knowledge (retrieval accesses organizational data not in the LLM's training set). Currency (the knowledge base can be updated continuously, while the LLM's training data is static).

For data infrastructure, RAG creates requirements at every layer: Vector storage for document embeddings (requiring specialized indexes). Document management for source documents (parsing, chunking, versioning). Governance spanning from source document access controls through embedding generation to retrieval results and generated output — a single RAG query may access documents with different sensitivity levels, and the generated response must respect the most restrictive governance requirement of any retrieved document. Quality monitoring across the full pipeline: retrieval precision (did we find the right documents?), retrieval recall (did we miss important documents?), generation faithfulness (does the answer match the retrieved context?), and generation relevance (does the answer address the question?).

Semantic Observability

A monitoring approach that captures not just what an AI system did, but why it did it — the reasoning chain from data access through inference to action, with enough structured context to debug, audit, and improve the system's behavior.

Traditional observability records events: "Agent X queried table Y at time T." Semantic observability records meaning: "Agent X queried table Y to retrieve customer 12345's interaction history because it was evaluating a refund request for order #67890. It retrieved 15 interactions, identified 3 relevant complaints about delayed shipping, determined the refund was justified based on the company's 30-day satisfaction policy, and proposed a $250 refund."

MLflow Tracing is the closest production implementation. It captures hierarchical span trees with typed span types (AGENT, LLM, TOOL, RETRIEVER, EMBEDDING, RERANKER, PARSER) and allows attaching typed assessments (human feedback, LLM-judge scores) directly to traces. The OpenTelemetry foundation enables integration with existing observability infrastructure (Jaeger, Zipkin, Datadog).

The gap between semantic observability and traditional logging mirrors the gap between debugging AI systems and debugging traditional software. When a SQL query returns wrong results, you examine the query plan. When an AI agent takes a wrong action, you need to examine the reasoning chain — which data was retrieved, how it was interpreted, what alternatives were considered, and what policy evaluation led to the action. This level of introspection requires structured, typed, hierarchical tracing — not flat log lines.

Tool Use (Function Calling)

The capability of AI models — particularly LLMs — to invoke external tools, APIs, and functions as part of their reasoning process. Instead of producing only text output, a model equipped with tool use can: query a database, call a REST API, execute code, search the web, perform calculations, or invoke other models — extending its capabilities beyond what's encoded in its parameters.

The mechanism: the model receives a description of available tools (name, description, parameter schema) alongside the user's request. When the model determines that a tool would help, it generates a structured function call (tool name + arguments in JSON). The calling system executes the function, returns the result to the model, and the model incorporates the result into its reasoning and response. This loop can repeat — a single interaction may involve multiple sequential tool calls.

Tool use is the fundamental mechanism through which AI agents interact with data infrastructure. An agent "reading customer data" is invoking a tool that queries the warehouse. An agent "updating a CRM record" is invoking a tool that writes to an operational system. An agent "checking data quality" is invoking a tool that runs a validation function. Every interaction between an agent and data infrastructure is, mechanically, a tool call.

For data infrastructure governance, tool use creates the action surface that must be governed. Each tool call is a potential data access, modification, or external interaction that needs authorization. The propose → authorize → execute pattern applies at the tool-call level: the agent proposes a function call, the governance layer evaluates whether that specific call with those specific parameters is authorized in the current context, and the system executes (or denies) the call.

Transformer Architecture

The neural network architecture — introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google — that underlies virtually all modern large language models and increasingly dominates computer vision and multi-modal AI. Transformers replaced recurrent neural networks (RNNs) as the dominant sequence modeling architecture by enabling parallel processing of input sequences.

The key innovation is the self-attention mechanism: for each position in the input sequence, the model computes attention weights over all other positions, determining how much each position should influence the representation of the current position. This enables the model to capture long-range dependencies (relationships between words far apart in the text) that RNNs struggled with.

The computational profile of transformers drives infrastructure requirements: Matrix multiplications dominate computation, benefiting from GPU/TPU parallelism. Attention computation scales quadratically with sequence length in standard implementations (O(n²) for sequence length n), though efficient attention variants (FlashAttention, sparse attention, linear attention) reduce this. KV-cache during inference requires memory proportional to sequence length × number of layers × model dimension — for large models with long contexts, this can consume tens of gigabytes per concurrent request.

For data infrastructure, the transformer architecture matters because it defines the computational profile that AI workloads present: GPU-intensive, memory-bandwidth-bound during inference, with costs proportional to token count. Understanding this profile is essential for infrastructure planning — serving an agent that processes 100K tokens of context per request has fundamentally different infrastructure requirements than serving a BI dashboard that executes SQL queries.

Vector Database / Vector Index

A storage system optimized for storing, indexing, and searching high-dimensional vector embeddings using approximate nearest neighbor algorithms. Vector databases are the storage infrastructure that makes semantic search, recommendation, and RAG pipelines operationally feasible at scale.

Dedicated vector databases: Pinecone (fully managed, serverless), Weaviate (open-source, supports hybrid search combining vector and keyword), Qdrant (open-source, Rust-based, emphasizing performance), Milvus (open-source, distributed, targeting billion-scale collections), Chroma (open-source, designed for simplicity in development). These provide specialized vector operations: filtered search (combine vector similarity with metadata filters), multi-vector search, vector updates, and specialized indexing.

Platform-integrated vector search: Snowflake Cortex Search (semantic search over document collections with SQL integration), Databricks Vector Search Index (auto-syncing vector index on Delta tables with Unity Catalog governance), PostgreSQL pgvector (extension adding vector operations to the world's most popular open-source database).

The architectural tension: should vector search be a separate system (dedicated vector DB) or integrated into the primary data platform? Separate systems offer specialized performance, richer vector operations, and purpose-built scaling. Integrated systems offer unified governance (vector and structured data under the same access controls), simplified operations (one platform to manage), and cross-modal queries (JOIN vector search results with structured tables in a single SQL statement).

For AI agents that need to combine vector similarity search with structured data filtering, governance enforcement, and lineage tracking — all within a single governed operation — integration is architecturally superior. But current platform integrations are shallow: Snowflake's Cortex Search and Databricks' Vector Search are separate service objects, not deeply integrated into the SQL query planner or governance model. The "Universal Data Surface" proposed in the AI-native infrastructure architecture would make vector search a native query capability with full optimizer, governance, and lineage integration.

Part of The Sciencer Wiki. Maintained by The Sciencer Company.