Architecture Overview
This page gives a bird’s-eye view of ProvSQL’s internals: how the extension is loaded, how its components are organized, and how data flows from an SQL query to a provenance evaluation result. For a detailed walkthrough of the query rewriting pipeline, see Query Rewriting Pipeline.
Extension Lifecycle
ProvSQL is a PostgreSQL shared-library extension. Because it installs
a planner hook, the library must be loaded at server start via the
shared_preload_libraries configuration variable; it cannot be loaded
on demand.
When PostgreSQL starts, it calls _PG_init, which:
Registers six GUC (Grand Unified Configuration) variables:
provsql.active– enable/disable provenance tracking (default: on).provsql.where_provenance– enable where-provenance (default: off).provsql.update_provenance– track provenance through DML statements (default: off).provsql.verbose_level– verbosity for debug messages (0–100, default: 0).provsql.aggtoken_text_as_uuid– when on,agg_token_outemits the underlying provenance UUID instead of the default"value (*)"display string (default: off).provsql.tool_search_path– colon-separated directories prepended toPATHwhen invoking external tools (d4, c2d, minic2d, dsharp, weightmc, graph-easy); seeexternal_tool.cpp.
Installs the planner hook (
provsql_planner) by saving the previous hook inprev_plannerand replacingplanner_hook.Installs the ProcessUtility hook (
provsql_ProcessUtility) by saving the previous hook inprev_ProcessUtility. Used by the CTAS /SELECT INTO/ matview lineage propagation path (see TID / BID Propagation Through Derived Relations).Installs shared-memory hooks for inter-process coordination (see Memory Management).
Installs executor hooks (
ExecutorStart/ExecutorEnd) that maintain a nesting depth counter so theprovsql_classify_queryNOTICE fires only on the user’s outermost statement (the rewriter’s PL/pgSQL helpers replan internally, which would otherwise produce spurious extra NOTICEs).Launches the mmap background worker that manages persistent circuit storage.
When the server shuts down, _PG_fini restores the previous
planner, ProcessUtility, executor, and shared-memory hooks.
Component Map
ProvSQL is a mixed C/C++ codebase. The PostgreSQL interface layer is written in C (required by the extension API); complex data structures and algorithms are in C++.
C files (PostgreSQL interface layer):
Planner hook and query rewriting
provsql.c– planner hook, ProcessUtility hook (CTAS /SELECT INTO/ matview lineage), executor hooks, and the bulk of the query rewriting logic (~3700 lines).safe_query.c/safe_query.h– safe-query rewriter for hierarchical CQs (Dalvi & Suciu 2012), gated onprovsql.boolean_provenance; includes the FD-aware extensions (constant-selection elimination, PK FDs, deterministic-relation transparency, PK-unifiable / disjoint-constant self-joins) and the propagation normalisation pre-passes (INNER/CROSSJoinExpr flattening, simple-subquery inlining). See Safe-Query Rewriter.classify_query.c/classify_query.h– query-time TID / BID / OPAQUE classifier exposed through theprovsql.classify_top_levelGUC.
Utilities and shared state
provsql_utils.c/provsql_utils.h– OID cache (get_constants), type helpers, gate-type enum.provsql_error.h–provsql_error/_warning/_notice/_logmacros.c_cpp_compatibility.h– small shims for mixing C and C++ sources.
Memory-mapped circuit store
provsql_mmap.c/provsql_mmap.h– background worker and IPC primitives.provsql_shmem.c/provsql_shmem.h– shared-memory segment setup.
SQL-callable functions
provenance.c– error stub for theprovenanceSQL function (reached only when a query bypasses the planner hook).provenance_evaluate.c– SQL-level semiring evaluation (user-definedplus/times/… functions).aggregation_evaluate.c– aggregate evaluation entry point.agg_token.c/agg_token.h– theagg_tokencomposite type (UUID + running value).
PostgreSQL version compatibility
compatibility.c/compatibility.h– shims for cross-version PostgreSQL API differences.
C++ files (data structures and algorithms):
Circuit representation
Circuit.h/Circuit.hpp– template base class parameterised by gate type; inherited by all circuit variants.GenericCircuit.h/GenericCircuit.hpp/GenericCircuit.cpp– semiring-agnostic in-memory circuit.BooleanCircuit.h/BooleanCircuit.cpp– Boolean circuit used for knowledge compilation and probability evaluation.WhereCircuit.h/WhereCircuit.cpp– where-provenance circuit.DotCircuit.h/DotCircuit.cpp– GraphViz DOT export of circuits.
Persistent storage and in-memory reconstruction
MMappedCircuit.h/MMappedCircuit.cpp– mmap-backed persistent circuit store.CircuitFromMMap.h/CircuitFromMMap.cpp– reads the mmap store to build in-memoryGenericCircuit/BooleanCircuitinstances.CircuitCache.h/CircuitCache.cpp/circuit_cache.h– per-session gate cache.MMappedUUIDHashTable.h/MMappedUUIDHashTable.cpp– open-addressing hash table keyed by UUID, stored in mmap.MMappedVector.h/MMappedVector.hpp–std::vector-like container over an mmap region.MMappedTableInfo.h– the per-relationProvenanceTableInforecord (TID / BID / OPAQUE kind, multi-column BID block-key, base-ancestor set) ; see Per-Table Provenance Metadata.
Semiring evaluation
semiring/*.h– header-only semiring implementations (Boolean, BoolExpr, Counting, Formula, IntervalUnion, Lukasiewicz, Tropical, Viterbi, Which, Why).provenance_evaluate_compiled.cpp/provenance_evaluate_compiled.hpp– dispatcher for compiled semirings.having_semantics.cpp/having_semantics.hpp– pre-evaluation ofHAVINGsub-circuits before the main semiring traversal.
Aggregation
Aggregation.h/Aggregation.cpp– aggregate operator enum, accumulator interface, and built-in accumulators (see Aggregation Provenance).
Probability, Shapley, knowledge compilation
probability_evaluate.cpp– probability-method dispatcher (see Probability Evaluation).dDNNF.h/dDNNF.cpp– d-DNNF data structure and linear-time probability evaluation.dDNNFTreeDecompositionBuilder.h/dDNNFTreeDecompositionBuilder.cpp– constructs a d-DNNF from a tree decomposition.TreeDecomposition.h/TreeDecomposition.cpp– tree decomposition via min-fill elimination.TreeDecompositionKnowledgeCompiler.cpp– the standalonetdkcbinary built bymake tdkc.provsql_migrate_mmap.cpp– the standaloneprovsql_migrate_mmapbinary built bymake provsql_migrate_mmap(migrates pre-1.3.0 flat mmap files to the per-database layout).shapley.cpp– Shapley and Banzhaf value computation.
Export and visualization
view_circuit.cpp– SQLview_circuitfunction (renders a DOT graph viagraph-easy).to_prov.cpp– PROV-XML export.where_provenance.cpp– SQL where-provenance output function.
C++ utilities
provsql_utils_cpp.h/provsql_utils_cpp.cpp– C++ counterparts toprovsql_utils.h, including UUID string conversion.subset.cpp/subset.hpp– subset enumeration used by HAVING evaluation.Graph.h– lightweight graph helpers used by the tree decomposition code.PermutationStrategy.h– pluggable vertex-ordering heuristic for tree decomposition.flat_map.hpp/flat_set.hpp– contiguous associative containers used inside hot loops.
Data Flow
The end-to-end flow of a query through ProvSQL:
![digraph dataflow {
rankdir=LR;
node [shape=box, fontname="sans-serif", fontsize=11];
edge [fontsize=9, fontname="sans-serif"];
sql [label="SQL query", shape=ellipse];
planner [label="provsql_planner"];
rewrite [label="process_query\n(rewriting)"];
exec [label="PostgreSQL\nexecutor"];
circuit [label="Circuit\n(mmap storage)"];
eval [label="Semiring\nevaluation"];
result [label="Query result\n+ provenance", shape=ellipse];
sql -> planner [label="Query tree"];
planner -> rewrite [label="has provenance?"];
rewrite -> exec [label="rewritten query"];
exec -> circuit [label="UUID tokens\n(gate creation)"];
exec -> result [label="tuples + UUIDs"];
circuit -> eval [label="circuit DAG"];
eval -> result [label="semiring values\nor probabilities"];
}](../_images/graphviz-4b353c065500fd24fcb3df9a9e912f3109bd2ea5.png)
The user submits an SQL query. PostgreSQL parses it into a
Querytree and calls the planner.provsql_plannerintercepts the call. If the query touches provenance-tracked tables (detected byhas_provenance), it callsprocess_queryto rewrite it.The rewritten query carries an extra UUID expression in its target list. When the executor evaluates the query, it calls ProvSQL’s SQL-level functions (
provenance_times,provenance_plus, etc.) to construct circuit gates. These calls route through the mmap worker to persist the circuit.Each result tuple comes back with a UUID identifying the root gate of its provenance sub-circuit.
To evaluate provenance, the user calls functions like
provenance_evaluate,probability_evaluate, or a compiled semiring evaluator (e.g.,sr_boolean). These retrieve the circuit from mmap, build an in-memoryGenericCircuitorBooleanCircuit, and traverse the DAG applying semiring operations.
The OID Cache: constants_t
PostgreSQL identifies types, functions, and operators by their Object
Identifiers (OIDs). ProvSQL needs to reference its own types and
functions when constructing rewritten query trees, so it caches their
OIDs in a constants_t structure.
Key fields:
Type OIDs:
OID_TYPE_UUID,OID_TYPE_AGG_TOKEN,OID_TYPE_GATE_TYPE, and standard types (BOOL,INT,FLOAT,VARCHAR).Function OIDs:
OID_FUNCTION_PROVENANCE_PLUS,OID_FUNCTION_PROVENANCE_TIMES,OID_FUNCTION_PROVENANCE_MONUS,OID_FUNCTION_PROVENANCE_DELTA,OID_FUNCTION_PROVENANCE_AGGREGATE,OID_FUNCTION_PROVENANCE_SEMIMOD, etc.Gate-type mapping:
GATE_TYPE_TO_OID[nb_gate_types]maps eachgate_typeenum value to the OID of the correspondingprovenance_gateenum member in PostgreSQL.Status flag:
okistrueif the OIDs were loaded successfully (falseif the extension is not installed in the current database).
The cache is populated by get_constants, which looks up OIDs in
the system catalogs on first call and stores them per-database.
Subsequent calls return the cached values without catalog access.
Gate Types
The provenance circuit is a directed acyclic graph (DAG) whose nodes are
gates. Each gate has a type from the gate_type enum
defined in provsql_utils.h:
Gate type |
Meaning |
|---|---|
|
Leaf gate representing a base-table tuple. |
|
Semiring addition (⊕): duplicate elimination, UNION. |
|
Semiring multiplication (⊗): joins, cross products. |
|
M-semiring monus (⊖): EXCEPT. |
|
Projection gate (where-provenance). |
|
Equijoin gate (where-provenance). |
|
Semiring additive identity (0). |
|
Semiring multiplicative identity (1). |
|
Aggregation operator. |
|
Semimodule scalar multiplication (for aggregation). |
|
Delta operator (δ-semiring). |
|
Scalar constant value. The |
|
Multivalued input (for Boolean probability). |
|
Comparison gate used in HAVING sub-circuits ( |
|
Update-provenance gate. |
|
Continuous random-variable leaf. The |
|
|
|
Probabilistic mixture of scalar random-variable roots gated by
a Bernoulli weight. The wire vector is |
|
Transparent single-child marker wrapping a per-row root, added by the safe-query rewriter to record that the rewrite assumed a Boolean semiring. Inert (identity) for every evaluator; see Semiring Evaluation. |
|
Transparent single-child marker carrying an |
The three random-variable gate types (gate_rv, gate_arith,
gate_mixture) and the two transparent marker gates
(gate_assumed_boolean, gate_annotation) are appended to the enum
before gate_invalid, with no renumbering of older values. See
Continuous Distributions for the full architecture of the
continuous-distribution surface.
Edges (wires) connect parent gates to their children, forming the provenance formula for each query result tuple.