Query Rewriting Pipeline
This page is a detailed walkthrough of provsql.c, the core of
ProvSQL. It describes how every SQL query is intercepted and rewritten
to propagate provenance tokens.
For the high-level architecture and data-flow overview, see Architecture Overview.
Entry Point: provsql_planner
PostgreSQL’s planner hook allows extensions to intercept every query
before planning. provsql_planner is installed by
_PG_init and is called for every query:
INSERT ... SELECT(CMD_INSERT): delegates toprocess_insert_selectto propagate provenance from the sourceSELECTinto the target table’sprovsqlcolumn.SELECT(CMD_SELECT): checks whether any relation in the range table carries aprovsqlcolumn (has_provenance). If so, callsprocess_queryto rewrite the query tree.After rewriting, the (possibly modified) query is passed to the previous planner hook or
standard_planner.
When provsql.verbose_level >= 20 (PostgreSQL 15+), the full query
text is logged before and after rewriting. At level ≥ 40, the time
spent in the rewriter is logged.
process_query: The Main Rewriting Function
process_query is the heart of ProvSQL. It receives a
Query tree, rewrites it to carry provenance, and returns the
modified tree. The function is recursive: subqueries and CTEs are
processed by re-entering process_query.
The function proceeds in the following order:
Step 0: Early Exit
If the query has no FROM clause (q->rtable == NULL), there is
nothing to track – return immediately.
Step 1: CTE Inlining
Non-recursive common table expressions (WITH clauses) are
inlined as subqueries in the range table via inline_ctes.
This converts RTE_CTE entries to RTE_SUBQUERY so that the
subsequent recursive processing can track provenance through them.
Recursive CTEs are not supported and raise an error.
Inlining happens before set-operation handling because UNION /
EXCEPT branches may reference CTEs.
Step 2: Strip Existing Provenance Columns
remove_provenance_attributes_select scans the target list and
removes any existing provsql UUID columns (which might be inherited
from base tables or subqueries). These are stripped so that the
rewriter can append a single, freshly computed provenance expression at
the end. Matching GROUP BY / ORDER BY references and
set-operation column lists are also adjusted.
Step 3: Set-Operation Handling
If the query has setOperations (UNION, EXCEPT):
Non-ALL variants (
UNION,EXCEPTwithoutALL):rewrite_non_all_into_external_group_bywraps the set operation in a new outer query withGROUP BYon all columns. This implements duplicate elimination as provenance addition (⊕). The function then re-entersprocess_queryon the wrapper.UNION ALL: each branch is processed independently. The provenance tokens from different branches are combined with ⊕ (provenance_plus) byprocess_set_operation_union.EXCEPT ALL:transform_except_into_joinrewritesA EXCEPT ALL Bas aLEFT JOINwith aprovenance_monus(⊖) gate, plus a filter removing zero-provenance tuples.INTERSECTis not supported (raises an error).
Step 4: Aggregate DISTINCT Rewrite
If the query has aggregates with DISTINCT (e.g.,
COUNT(DISTINCT x)), rewrite_agg_distinct performs a
structural rewrite: the DISTINCT inside the aggregate is moved to
an inner subquery with GROUP BY, and the outer query aggregates
over the deduplicated results. The function returns a new query tree,
and process_query re-enters on it.
Step 5: Discovery – get_provenance_attributes
get_provenance_attributes walks every entry in the range table
and collects Var nodes pointing to provenance columns:
RTE_RELATION (base table): scans column names for one called
provsqlof type UUID.RTE_SUBQUERY: recursively calls
process_queryon the subquery. If the subquery’s rewritten target list contains aprovsqlcolumn, aVarpointing to it is added to the parent’s provenance list. OuterVarattribute numbers are adjusted to account for any columns removed from the inner query.RTE_JOIN: for inner, left, right, and full joins, nothing is collected directly – the underlying base-table RTEs supply the tokens. Semi-joins and anti-joins raise an error.
RTE_FUNCTION: if the function returns a single UUID column named
provsql, it is collected.RTE_VALUES / RTE_GROUP: no provenance to collect.
If no provenance attributes are found, the query is returned unmodified.
Step 6: Unsupported Feature Checks
Before proceeding, the function checks for:
Sublinks (
EXISTS,IN, scalar subqueries): not supported.DISTINCT ON: not supported.DISTINCT(plain): converted toGROUP BYviatransform_distinct_into_group_by.GROUPING SETS/CUBE/ROLLUP: not supported (except the trivialGROUP BY ()).
Step 7: Build Column Map
build_column_map creates a per-RTE mapping from column
attribute numbers to global column identifiers. This is used by
where-provenance to record which columns participate in equijoin
conditions and projections.
Step 8: Aggregation Rewriting
If the query has aggregates, replace_aggregations_by_provenance_aggregate
walks the target list and replaces each standard aggregate (e.g.,
SUM, COUNT) with a provenance_aggregate call that wraps
the original aggregate result and the provenance of the aggregated
rows. The provenance of grouped rows is combined with ⊕
(provenance_plus) via array_agg + provenance_plus.
After aggregation rewriting:
migrate_aggtoken_quals_to_havingmoves anyWHEREcomparisons on aggregate results toHAVING, because aggregate-typed values can only be filtered after grouping.insert_agg_token_castsinserts type casts foragg_tokenvalues used in arithmetic or window functions.
See Aggregation Provenance for the semantics of the agg /
semimod / value gates produced here, and for the exact
shape of the provenance_aggregate call that replaces each
Aggref.
Step 9: Expression Building – make_provenance_expression
make_provenance_expression combines the collected provenance
Var nodes into a single expression tree:
Combining tokens from multiple tables (the ⊗ / ⊕ / ⊖ choice):
SR_TIMES(default forSELECT ... FROM ... WHERE): wraps the list of provenance tokens inprovenance_times(ARRAY[...])to create a multiplication gate.SR_PLUS(UNION ALL): uses the single provenance token from the union directly (each branch already has its own token).SR_MONUS(EXCEPT ALL): wraps the two tokens inprovenance_monus(left, right).
If a single table is in the FROM clause, no combining function is
needed – the token is used as-is.
GROUP BY / aggregation: When group_by_rewrite or
aggregation is true, the combined token is wrapped in
provenance_plus(array_agg(token)) – this sums the provenance of
all tuples in each group.
Delta gate: For queries with aggregation but no HAVING
clause, a provenance_delta gate is added. This implements the
δ-semiring operator that normalizes aggregate provenance.
HAVING: When a HAVING clause is present,
having_Expr_to_provenance_cmp translates the HAVING
predicate into a provenance_cmp gate tree. The original
havingQual is removed from the query (its semantics are now
captured in the provenance circuit).
Where-provenance (when provsql.where_provenance is enabled):
Equijoin gates:
add_eq_from_Quals_to_ExprscansJOIN ... ONconditions andWHEREequalities, wrapping the provenance expression inprovenance_eqgates that record which column pairs were compared.Projection gates: if the output columns are a strict subset of the input columns (or are reordered), a
provenance_projectgate is added with an integer array recording the column mapping.
Step 10: Splicing – add_to_select
add_to_select appends the provenance expression to the
query’s target list as a new column named provsql. If the query
has GROUP BY, the column is added to the groupClause as well.
Step 11: Replace provenance() Calls
replace_provenance_function_by_expression walks the target
list looking for calls to the provenance() SQL function. Each
occurrence is replaced with the computed provenance expression, so
that SELECT provenance() FROM ... returns the actual provenance
token.
Rewriting Rules and Formal Semantics
The rewriting implemented by provsql.c realises the
rewriting rules (R1)–(R5) from the ICDE 2026 paper
[Sen et al., 2026], which is the authoritative reference for
the formal semantics and correctness results. The rules are stated
over an extended relational algebra on annotated relations;
provsql.c has to reproduce them on PostgreSQL Query
trees, so the mapping below is approximate rather than literal –
the pipeline phases described earlier on this page interleave the
rules with PostgreSQL-specific bookkeeping (range-table surgery,
target-list rewriting, HAVING handling, where-provenance…).
Rule |
Algebra operator |
Concrete realisation in |
|---|---|---|
(R1) |
Projection \(\Pi\) |
The provenance token column is kept on the target list.
|
(R2) |
Cross product / join \(\times\) |
The |
(R3) |
Duplicate elimination \(\varepsilon\) |
|
(R4) |
Multiset difference \(-\) |
|
(R5) |
Aggregation \(\gamma\) |
|
Two algebra operators are deliberately not rewritten, matching the paper:
Selection \(\sigma\) –
WHEREclauses pass through unchanged because selection does not affect provenance (with where-provenance enabled, the rewriter additionally wraps the result inprovenance_eq/provenance_projectgates, but that is orthogonal to R1–R5).Multiset sum \(\uplus\) –
UNION ALLis left alone byprocess_set_operation_union; the tokens from each branch flow through independently.
Formal Verification
The correctness of rules (R1) and (R2) – projection and cross
product – has a machine-checked proof in the ProvSQL Lean 4
library. The main theorem is
Query.rewriting_valid
in the Provenance.QueryRewriting module: for every
supported relational-algebra query q and every annotated
database d, evaluating q against the annotated semantics
and then projecting to the composite (tuple + annotation)
representation yields the same result as evaluating the
rewritten query against the composite database. This is the
formal analogue of the “the rewritten query produces the same
provenance as the annotated semantics” correctness statement
from the ICDE paper, for the partial fragment proved.