Architecture Overview

This page gives a bird’s-eye view of ProvSQL’s internals: how the extension is loaded, how its components are organized, and how data flows from an SQL query to a provenance evaluation result. For a detailed walkthrough of the query rewriting pipeline, see Query Rewriting Pipeline.

Extension Lifecycle

ProvSQL is a PostgreSQL shared-library extension. Because it installs a planner hook, the library must be loaded at server start via the shared_preload_libraries configuration variable; it cannot be loaded on demand.

When PostgreSQL starts, it calls _PG_init, which:

  1. Registers six GUC (Grand Unified Configuration) variables:

    • provsql.active – enable/disable provenance tracking (default: on).

    • provsql.where_provenance – enable where-provenance (default: off).

    • provsql.update_provenance – track provenance through DML statements (default: off).

    • provsql.verbose_level – verbosity for debug messages (0–100, default: 0).

    • provsql.aggtoken_text_as_uuid – when on, agg_token_out emits the underlying provenance UUID instead of the default "value (*)" display string (default: off).

    • provsql.tool_search_path – colon-separated directories prepended to PATH when invoking external tools (d4, c2d, minic2d, dsharp, weightmc, graph-easy); see external_tool.cpp.

  2. Installs the planner hook (provsql_planner) by saving the previous hook in prev_planner and replacing planner_hook.

  3. Installs the ProcessUtility hook (provsql_ProcessUtility) by saving the previous hook in prev_ProcessUtility. Used by the CTAS / SELECT INTO / matview lineage propagation path (see TID / BID Propagation Through Derived Relations).

  4. Installs shared-memory hooks for inter-process coordination (see Memory Management).

  5. Installs executor hooks (ExecutorStart / ExecutorEnd) that maintain a nesting depth counter so the provsql_classify_query NOTICE fires only on the user’s outermost statement (the rewriter’s PL/pgSQL helpers replan internally, which would otherwise produce spurious extra NOTICEs).

  6. Launches the mmap background worker that manages persistent circuit storage.

When the server shuts down, _PG_fini restores the previous planner, ProcessUtility, executor, and shared-memory hooks.

Component Map

ProvSQL is a mixed C/C++ codebase. The PostgreSQL interface layer is written in C (required by the extension API); complex data structures and algorithms are in C++.

C files (PostgreSQL interface layer):

Planner hook and query rewriting

  • provsql.c – planner hook, ProcessUtility hook (CTAS / SELECT INTO / matview lineage), executor hooks, and the bulk of the query rewriting logic (~3700 lines).

  • safe_query.c / safe_query.h – safe-query rewriter for hierarchical CQs (Dalvi & Suciu 2012), gated on provsql.boolean_provenance ; includes the FD-aware extensions (constant-selection elimination, PK FDs, deterministic-relation transparency, PK-unifiable / disjoint-constant self-joins) and the propagation normalisation pre-passes (INNER / CROSS JoinExpr flattening, simple-subquery inlining). See Safe-Query Rewriter.

  • classify_query.c / classify_query.h – query-time TID / BID / OPAQUE classifier exposed through the provsql.classify_top_level GUC.

Utilities and shared state

Memory-mapped circuit store

SQL-callable functions

PostgreSQL version compatibility

C++ files (data structures and algorithms):

Circuit representation

Persistent storage and in-memory reconstruction

Semiring evaluation

Aggregation

Probability, Shapley, knowledge compilation

Export and visualization

C++ utilities

Data Flow

The end-to-end flow of a query through ProvSQL:

digraph dataflow {
  rankdir=LR;
  node [shape=box, fontname="sans-serif", fontsize=11];
  edge [fontsize=9, fontname="sans-serif"];

  sql [label="SQL query", shape=ellipse];
  planner [label="provsql_planner"];
  rewrite [label="process_query\n(rewriting)"];
  exec [label="PostgreSQL\nexecutor"];
  circuit [label="Circuit\n(mmap storage)"];
  eval [label="Semiring\nevaluation"];
  result [label="Query result\n+ provenance", shape=ellipse];

  sql -> planner [label="Query tree"];
  planner -> rewrite [label="has provenance?"];
  rewrite -> exec [label="rewritten query"];
  exec -> circuit [label="UUID tokens\n(gate creation)"];
  exec -> result [label="tuples + UUIDs"];
  circuit -> eval [label="circuit DAG"];
  eval -> result [label="semiring values\nor probabilities"];
}
  1. The user submits an SQL query. PostgreSQL parses it into a Query tree and calls the planner.

  2. provsql_planner intercepts the call. If the query touches provenance-tracked tables (detected by has_provenance), it calls process_query to rewrite it.

  3. The rewritten query carries an extra UUID expression in its target list. When the executor evaluates the query, it calls ProvSQL’s SQL-level functions (provenance_times, provenance_plus, etc.) to construct circuit gates. These calls route through the mmap worker to persist the circuit.

  4. Each result tuple comes back with a UUID identifying the root gate of its provenance sub-circuit.

  5. To evaluate provenance, the user calls functions like provenance_evaluate, probability_evaluate, or a compiled semiring evaluator (e.g., sr_boolean). These retrieve the circuit from mmap, build an in-memory GenericCircuit or BooleanCircuit, and traverse the DAG applying semiring operations.

The OID Cache: constants_t

PostgreSQL identifies types, functions, and operators by their Object Identifiers (OIDs). ProvSQL needs to reference its own types and functions when constructing rewritten query trees, so it caches their OIDs in a constants_t structure.

Key fields:

  • Type OIDs: OID_TYPE_UUID, OID_TYPE_AGG_TOKEN, OID_TYPE_GATE_TYPE, and standard types (BOOL, INT, FLOAT, VARCHAR).

  • Function OIDs: OID_FUNCTION_PROVENANCE_PLUS, OID_FUNCTION_PROVENANCE_TIMES, OID_FUNCTION_PROVENANCE_MONUS, OID_FUNCTION_PROVENANCE_DELTA, OID_FUNCTION_PROVENANCE_AGGREGATE, OID_FUNCTION_PROVENANCE_SEMIMOD, etc.

  • Gate-type mapping: GATE_TYPE_TO_OID[nb_gate_types] maps each gate_type enum value to the OID of the corresponding provenance_gate enum member in PostgreSQL.

  • Status flag: ok is true if the OIDs were loaded successfully (false if the extension is not installed in the current database).

The cache is populated by get_constants, which looks up OIDs in the system catalogs on first call and stores them per-database. Subsequent calls return the cached values without catalog access.

Gate Types

The provenance circuit is a directed acyclic graph (DAG) whose nodes are gates. Each gate has a type from the gate_type enum defined in provsql_utils.h:

Gate type

Meaning

gate_input

Leaf gate representing a base-table tuple.

gate_plus

Semiring addition (⊕): duplicate elimination, UNION.

gate_times

Semiring multiplication (⊗): joins, cross products.

gate_monus

M-semiring monus (⊖): EXCEPT.

gate_project

Projection gate (where-provenance).

gate_eq

Equijoin gate (where-provenance).

gate_zero

Semiring additive identity (0).

gate_one

Semiring multiplicative identity (1).

gate_agg

Aggregation operator.

gate_semimod

Semimodule scalar multiplication (for aggregation).

gate_delta

Delta operator (δ-semiring).

gate_value

Scalar constant value. The extra blob encodes the literal in text form; the integer mode (parsed by extract_constant_C) is used in HAVING sub-circuits and the float8 mode (parsed by extract_constant_double) is used to lift numeric constants into the continuous random-variable surface.

gate_mulinput

Multivalued input (for Boolean probability).

gate_cmp

Comparison gate used in HAVING sub-circuits (<, =, etc.).

gate_update

Update-provenance gate.

gate_rv

Continuous random-variable leaf. The extra blob encodes the distribution kind and parameters (normal:μ,σ, uniform:a,b, exponential:λ, erlang:k,λ).

gate_arith

N-ary arithmetic over scalar children. The operator tag (provsql_arith_op: PLUS / TIMES / MINUS / DIV / NEG) is stored in info1.

gate_mixture

Probabilistic mixture of scalar random-variable roots gated by a Bernoulli weight. The wire vector is [p, x, y] for a Bernoulli mixture or [key, mul_1, …, mul_n] for a categorical block.

gate_assumed_boolean

Transparent single-child marker wrapping a per-row root, added by the safe-query rewriter to record that the rewrite assumed a Boolean semiring. Inert (identity) for every evaluator; see Semiring Evaluation.

gate_annotation

Transparent single-child marker carrying an extra payload (the inversion-free certificate on a per-row root, or a per-input order key on an input). Unlike every other gate its UUID folds in extra (so two annotations over the same child with different extra are distinct gates); inert (identity) for every evaluator. See The Inversion-Free UCQ(OBDD) Path.

The three random-variable gate types (gate_rv, gate_arith, gate_mixture) and the two transparent marker gates (gate_assumed_boolean, gate_annotation) are appended to the enum before gate_invalid, with no renumbering of older values. See Continuous Distributions for the full architecture of the continuous-distribution surface.

Edges (wires) connect parent gates to their children, forming the provenance formula for each query result tuple.