Where-Provenance
Where-provenance is the finest-grained form of provenance ProvSQL tracks: for every output cell, it records the set of source table cells that the value was copied from. This is in contrast to the semiring (why-provenance and friends) world, which works at the tuple level: a semiring annotation tells you which input rows contributed to an output row, but says nothing about where each individual value came from. See Where-Provenance for the user-facing description.
The persistent circuit store does not change when
provsql.where_provenance is enabled: it is still the same DAG
of gates held in the mmap backend (see Memory Management). What
changes is that the rewriter emits two additional gate types
(project and eq) and – on the C++ side – where-provenance
queries use a different in-memory reconstruction class
(WhereCircuit) whose evaluator implements column-wise
semantics, instead of the GenericCircuit /
BooleanCircuit classes used by the semiring evaluators.
Why a Separate Circuit Class?
A semiring takes a value \(v\) and combines values with
plus / times / monus, but the result is just another
value of the same carrier type. Where-provenance is fundamentally
different: the result is a vector of locator sets, one entry per
output column. Plus and times must be promoted to act
column-wise, and two new operators (project and eq) appear
that have no semiring counterpart at all. Rather than shoehorn
this into the semiring interface, ProvSQL uses a dedicated
WhereCircuit class whose evaluator implements column-wise
semantics directly. The underlying gate DAG that
WhereCircuit walks is a sub-view of the same persistent
mmap store that semiring evaluation reads from – the two views
just pay attention to different gate types.
The WhereCircuit Data Model
A WhereCircuit is the in-memory reconstruction of a
sub-DAG of the mmap store specialised for where-provenance
evaluation. Every gate has one of the following types (declared
as WhereGate in WhereCircuit.h):
IN – a leaf representing one source tuple. Carries the origin table name, the tuple’s UUID, and the number of columns.
TIMES – conjunction of children’s locator vectors, concatenated column-wise. Used for joins / cross products.
PLUS – disjunction of children’s locator vectors, unioned column-wise. Used for
UNIONand duplicate elimination. All children must have the same number of columns.PROJECT – restricts and reorders the columns of its single child. Carries an integer vector
positions[i] = jmeaning “output column i takes its locators from input column j” (or 0 for “no source”, e.g. an expression that is not a bareVar).EQ – merges the locator sets of two columns of its single child. Used for equijoins: when
A.x = B.yis part of the join condition, the EQ gate makes the output column forA.xcarry the locators of bothA.xandB.y.
A WhereCircuit is reconstructed in memory by
where_provenance.cpp from the persistent mmap store every
time the SQL function where_provenance(token) is called.
Evaluation: Locators
The output of evaluation is a std::vector<std::set<Locator>>,
one entry per output column. A Locator (in WhereCircuit.h)
is a triple (table, tuple_uuid, column_position) – the address of
exactly one cell in one base table. The interpretation of each gate
type is straightforward:
IN \((\mathit{table},\mathit{uuid},n)\) returns a vector of length \(n\), with cell \(i\) set to \(\{(\mathit{table},\mathit{uuid},i)\}\).
TIMES concatenates the vectors of its children (column-wise concatenation). This is the right operation because joins place the columns of the two operands side by side.
PLUS takes the union of locator sets, column by column. Children must agree on the number of columns.
PROJECT with
positions = [p_1, ..., p_k]returns a vector of length \(k\) whose \(i\)-th entry is the child’s \(p_i\)-th column (or \(\emptyset\) if \(p_i = 0\)).EQ with positions \((i, j)\) evaluates the child to vector \(v\), then sets \(v_i \leftarrow v_i \cup v_j\) and \(v_j \leftarrow v_j \cup v_i\).
The output of where_provenance(token) is rendered as
{[loc;loc;...],[loc;...],...} – one bracket-pair per output
column, locators inside separated by semicolons.
Building the Circuit During Query Rewriting
When provsql.where_provenance is on,
make_provenance_expression builds the where-provenance
fragment in three phases on top of the regular semiring expression
(see Query Rewriting Pipeline for the surrounding context):
Column map.
build_column_mapwalks the range table and assigns a global integer position to every output column of every base RTE. The provsql column itself is mapped to-1; join RTEs (which redirect to base columns) are mapped to0. The result is a 2-D arraycolumns[rte][att]used by the next two phases.EQ gates from join conditions.
add_eq_from_Quals_to_Exprwalks theONclauses of theFROMjointree and the top-levelWHEREclause looking for equality predicates of the formVar = Var. For each one,add_eq_from_OpExpr_to_Exprlooks up both columns in the column map and wraps the current provenance expression in aprovenance_eq(expr, col1, col2)call – creating an EQ gate that records which two columns were compared.PROJECT gate from the SELECT list. After EQ gates have been added,
make_provenance_expressionwalks the target list one more time, building an integer array of column positions that theSELECTclause keeps and in what order. If that ordering differs from the identity ([1, 2, 3, ...]covering every column of every input), the expression is wrapped in aprovenance_project(expr, positions...)call. Output columns that are expressions rather than bareVars get position0– they have no source cell.
The result is a single token that, when evaluated by
where_provenance, walks the circuit, applies the gate semantics
above, and returns the per-column locator sets.
Sub-Circuit Materialization
To evaluate a where-provenance token, the C function
where_provenance.cpp doesn’t traverse the persistent mmap
store directly; instead it calls the SQL helper
provsql.sub_circuit_for_where(token), which walks the circuit
recursively from the root and returns one row per gate, including
the metadata each gate type needs (table name and column count for
IN, position pair for EQ, projection vector for PROJECT, etc.). The
C function then rebuilds a WhereCircuit from those rows and
calls evaluate.
The reason for this round-trip is that where-provenance gates carry structured annotations (column lists, position pairs, table OIDs) that the generic gate inspection functions used by the semiring evaluator cannot return in one shot. The SQL helper joins all the relevant catalogs and returns everything the C reconstructor needs.
Why-Provenance vs Where-Provenance
It is worth being explicit about how the two flavours of provenance differ at the level of the data structures the developer guide describes:
Why-provenance (semirings) |
Where-provenance |
|
|---|---|---|
Granularity |
Per output tuple |
Per output cell |
Output type |
A semiring value (Boolean, integer, formula…) |
|
In-memory class |
||
Gate types used |
|
|
Extra gates emitted |
Always (when ProvSQL is active) |
Only when |
Read by |
Where-provenance does not interact with aggregation, HAVING,
EXCEPT, or set operations beyond UNION ALL – features
covered by the semiring side – because cell-level lineage has no
agreed semantics under those operators. Enabling
provsql.where_provenance does not turn off the semiring side;
the same persistent circuit then contains both the regular
semiring gates and the extra project / eq gates, and each
gate type is interpreted by the evaluator that cares about it.