Probability Evaluation
ProvSQL computes probabilities by reducing a provenance circuit to Boolean form and then dispatching to one of several evaluation methods. This page explains the dispatch architecture, gives the background on the central data structures (d-DNNF, Tseytin encoding, weighted model counting, tree decomposition), and ends with a step-by-step guide for adding a new method. See Probabilities for the user-facing description of the existing methods.
The continuous-random-variable surface layers an analytical / hybrid path on top of this Boolean machinery; the architecture of that layer is documented separately in Continuous Distributions. The sections below cross-link to the relevant arms of the hybrid evaluator and the conditional inference path.
Architecture
The entry point is the SQL function probability_evaluate,
which calls provenance_evaluate_compiled on the C++ side.
That function builds a BooleanCircuit object from the
persistent circuit store and then calls
probability_evaluate_internal (in
probability_evaluate.cpp).
probability_evaluate_internal receives the method name as
a string and dispatches via a chain of if / else if branches.
There is no registration mechanism – methods are hardcoded in
the dispatcher.
Background: d-DNNF, Tseytin, Knowledge Compilation
Computing the probability that a Boolean formula evaluates to true
when its variables are assigned independently is -hard
in general, but tractable for structured representations. The
two structures ProvSQL exploits are d-DNNF and tree
decomposition, both of which give linear-time probability
evaluation in the size of the structure. The methods that ship
with ProvSQL all reduce to one or the other.
d-DNNF
A deterministic decomposable negation normal form (d-DNNF) is a
Boolean circuit built from AND, OR, NOT, and variable
leaves, satisfying two structural properties:
Decomposability: for every
ANDgate, the variable sets of its children are pairwise disjoint. This means the children represent independent events, and the probability of the AND is the product of the children’s probabilities.Determinism: for every
ORgate, the children represent mutually exclusive events. This means the probability of the OR is the sum of the children’s probabilities – no inclusion- exclusion correction is needed.
Together these two properties make a single bottom-up traversal
sufficient to compute the probability:
dDNNF::probabilityEvaluation does exactly that. The
implementation in dDNNF.cpp uses an explicit stack
instead of recursion to avoid blowing the call stack on very deep
circuits, and memoises intermediate results so that shared
sub-circuits are evaluated only once.
A general Boolean formula is not a d-DNNF. Producing a d-DNNF
from an arbitrary formula – knowledge compilation – is the
expensive part; once you have it, evaluation is cheap. The
compilation and tree-decomposition branches of the
dispatcher both end in a dDNNF object that
dDNNF::probabilityEvaluation then walks.
Tseytin Encoding
External knowledge compilers (d4, c2d, dsharp,
minic2d) and the weightmc model counter all consume
Boolean formulas in DIMACS CNF format. Producing CNF from a
ProvSQL Boolean circuit is the job of
BooleanCircuit::TseytinCNF (in BooleanCircuit.cpp),
whose string output each caller writes into its own
@c provsql::ScopedTempDir before invoking the tool.
The Tseytin transformation introduces one fresh variable per
internal gate of the circuit, then writes a small set of clauses
that encode the gate’s semantics. For an AND gate
, that’s one
binary clause
for every child, plus
one big clause
.
OR is dual. NOT becomes two two-literal clauses. A unit
clause forcing the root variable true is added at the end.
For weighted model counting (and the d4 compiler when built with
weight support), Tseytin also emits one w line per leaf
variable giving the probability of the corresponding ProvSQL
input gate – so the SAT-side of the pipeline knows the weights to
multiply through.
The output is dumped to a temporary file under /tmp;
BooleanCircuit::compilation then invokes the chosen
compiler with that file and reads the result back. The invocation
goes through run_external_tool (external_tool.cpp),
which honours the provsql.tool_search_path GUC by prepending
its value to PATH for the duration of the call. The tool runs
via /bin/sh -c in its own process group: while it runs the
backend polls for a pending cancel, and on statement_timeout /
pg_cancel_backend it SIGKILLs the whole group (so a tool
that ignores SIGINT or forks a worker into another process group,
as KCBox/Panini does, is still stopped) and then raises the interrupt
via CHECK_FOR_INTERRUPTS. Before composing the command line, the same call site pre-flights
the binary with find_external_tool, so a missing tool
fails with an actionable error rather than letting the shell return
exit 127. After the call, the wait status is decoded by
format_external_tool_status to distinguish “not found”,
“killed by signal”, and “ran and exited nonzero”. The same trio
is used by BooleanCircuit::wmcCount for the weighted model
counters and by DotCircuit::render for graph-easy.
Knowledge Compilers and the NNF Format
All four supported external compilers (d4, c2d,
dsharp, minic2d) produce a d-DNNF in the NNF text
format – a line-oriented representation where each line is one
node:
L lit– a leaf literal (positive or negative).A k c1 c2 ...– an AND ofchildren, given by their node indices.
O k c1 c2 ...– an OR ofchildren.
Modern d4 also emits a few extra node kinds (a / o / f
/ t for constants, and a decision-tree variant); the parser
in BooleanCircuit::compilation handles both the legacy
and the d4-extended dialects. The result is a
BooleanCircuit (with the d-DNNF invariants) that gets
returned to the caller and walked by
dDNNF::probabilityEvaluation.
The Panini compiler from KCBox ships with five target-language
modes (OBDD, OBDD[AND], Decision-DNNF, R2-D2,
CCDD) selected by the --lang flag. ProvSQL exposes the
first three; the R2-D2 and CCDD languages are rejected
upstream because both emit K (kernelize) nodes encoding
literal-equivalence constraints over a shared kernel variable,
breaking the decomposability invariant of the resulting d-DNNF.
Panini’s output is not the NNF text format but a CDD-style
DOT-like syntax; the panini-* registry records run the same
generic compile path as the other compilers but tag their output
panini-dd, so BooleanCircuit::compilation reads them
back with BooleanCircuit::parsePaniniDD instead of the NNF
parser. It translates each C / D line into a decomposable AND
and each v ? t : f decision into an explicit OR-of-AND-NOT
structure over the corresponding input gate. A K node, if seen,
raises an explicit error.
After any external-compiler call, dDNNF::simplify runs a
single canonical pass over the result: empty-constant folding,
short-circuiting on opposite-type empty children, and single-child
AND / OR collapse. The same pass is run on the in-process
tree-decomposition route and on BooleanCircuit::interpretAsDD,
so callers see a structurally canonical d-DNNF regardless of which
backend produced it.
Helper Surfaces (Studio and SQL Introspection)
Four small SQL helpers expose intermediate pipeline artifacts to the user and to Studio:
tseytin_cnf.cpp(tseytin_cnf) returns the DIMACS CNF that the external compilers would otherwise receive.tree_decomposition_dot.cpp(tree_decomposition_dot) renders the min-fill tree decomposition as GraphViz DOT, with the treewidth and a per-bag input map embedded as comment lines so consumers can resolve a bag variable back to its provenance UUID.compile_to_ddnnf_dot.cpp(compile_to_ddnnf_dot) goes one step further: it runs the requested compiler (or the meta-routestree-decomposition,interpret-as-dd,default– all dispatched throughBooleanCircuit::makeDD) and returns the resulting d-DNNF itself as DOT.tool_available.cpp(tool_available) is a thin wrapper aroundfind_external_toolso a SQL client can check whether a given tool resolves on the backend’s PATH plusprovsql.tool_search_path– used by Studio to grey out unselectable compilers in the picker.
None of these helpers participate in the probability dispatcher; they are purely introspection surfaces sharing the same Tseytin / NNF / tree-decomposition primitives as the production methods.
Weighted Model Counting
BooleanCircuit::wmcCount drives every weighted model
counter through one registry-selected path: it looks the named tool
up in the external-tool registry (or, with no tool named, picks the
highest-preference counter whose binary resolves on PATH), writes
the weighted CNF in the
dialect the record’s parser implies, runs the record’s command
template, and reads the count back the same way. Two conventions
are understood: MCC-2024 weighted DIMACS with a c s exact result
line (ganak, sharpsat-td, dpmc), and the WeightMC
approximate counter’s own dialect, whose "delta;epsilon"
precision argument is turned into a --pivotAC value controlling
how many random hash constraints it samples. Unlike a knowledge
compiler none of these produce a d-DNNF; each returns a single
probability the function parses as a double.
Tree Decomposition
The tree-decomposition path is ProvSQL’s “no external tool” route to a d-DNNF. Conceptually, a tree decomposition of a Boolean circuit is a tree of bags (sets of variables) such that every constraint of the circuit is captured by at least one bag, and the bags containing each variable form a connected subtree. The treewidth is one less than the size of the largest bag; the smaller it is, the more amenable the formula is to dynamic programming.
TreeDecomposition.cpp builds a tree decomposition of the
circuit’s primal graph using a min-fill elimination heuristic,
then normalises it (TreeDecomposition::makeFriendly)
so that every bag has at
most two children and every leaf bag introduces exactly one
variable. dDNNFTreeDecompositionBuilder.cpp then walks
the bag tree bottom-up, enumerating per-bag truth assignments and
gluing them into a d-DNNF whose decomposability and determinism
follow from the bag-cover structure of the decomposition. The
worst-case cost is ,
which is why ProvSQL caps the treewidth at
TreeDecomposition::MAX_TREEWIDTH (currently 10) and
falls back to compilation with d4 when that bound is
exceeded.
Both the min-fill elimination loop in the
TreeDecomposition constructor and the bottom-up d-DNNF
construction in
dDNNFTreeDecompositionBuilder::builddDNNF call
CHECK_FOR_INTERRUPTS in their hot loops so that
statement_timeout and pg_cancel_backend interrupt the
build promptly when the heuristic struggles on circuits close to
MAX_TREEWIDTH. The macro is conditionally compiled to a
no-op in the standalone tdkc binary via a TDKC guard.
Currently Supported Methods
Method string |
Implementation |
|---|---|
|
|
|
|
|
|
|
|
|
Backward-compatible alias for |
|
Builds a |
|
|
|
Fallback chain: try |
The branches for "compilation", "tree-decomposition", and
the default all funnel through BooleanCircuit::makeDD,
which dispatches further on the d-DNNF construction strategy.
The external-compiler choice inside compilation resolves the
named tool against the external-tool registry, which supplies its
executable, command template and output parser. Once
a dDNNF has been produced, probability evaluation is a
single linear-time pass
(dDNNF::probabilityEvaluation), because the d-DNNF
structure guarantees decomposability and determinism.
Cmp-Probability Pre-Passes
Before the methods above run, probability_evaluate.cpp walks
the circuit through a chain of pre-passes that resolve specific
gate_cmp shapes to a Bernoulli gate_input carrying a
closed-form probability. Resolving a cmp here shrinks the
circuit fed to the downstream method ; in the best case the whole
HAVING comparator collapses to a single leaf, bypassing DNF
construction entirely.
The chain (in order) :
runRangeCheck(also runs at load time whenprovsql.simplify_on_loadis on) : support-interval propagation throughgate_arithand decision of everygate_cmpdecidable from the support alone. Universal across semirings, so it lives both at load time and insideprobability_evaluate.runHybridDecomposer(gated byprovsql.hybrid_evaluation) : base-RV-footprint partitioning + per-island marginalisation for continuous-RV cmps (see the hybrid section below).runAnalyticEvaluator: closed-form CDF for trivial RV cmp shapes (singleton baregate_rvvsgate_value, or two bare normals). Probability-specific (the resultinggate_inputcarries a numeric probability with no semiring meaning), so it runs here and not at load time.runCountCmpEvaluator(gated byprovsql.cmp_probability_evaluation, hidden diagnostic default on) : recognises HAVINGgate_cmp(gate_agg(COUNT, semimod children), gate_value(C))and replaces the cmp with a Bernoulli carrying the Poisson-binomial CDFPr(B op C)over the per-row contributor marginals. Each semimod’s K child is that row’s contributor sub-circuit – a singlegate_input, or (for a join) atimes/plus/monusof several leaves; a small read-once recursion (contributorProb) computes its probability. Soundness condition : every structural gate inside a contributor (input/times/plus/monus) hasref_count == 1– a single check that makes the contributors’ leaf sets pairwise disjoint, unshared with the rest of the circuit, and read-once, so the Poisson-binomial trials are independent (plusref_count(gate_agg) == 1, catching multi-cmp HAVING over a shared COUNT). The DP dispatches on the smaller side ofC(lower tail directly, or upper tail via inverted Bernoullis) forO(N x min(C, N - C))total cost per cmp. Seesrc/CountCmpEvaluator.{h,cpp}.
Adding another closed-form cmp resolver (MIN / MAX / SUM, future
discrete-RV distributions…) follows the same shape : a
runXxxEvaluator function that walks gate_cmp gates, checks
shape + independence, computes the probability, calls
GenericCircuit::resolveCmpToBernoulli. Gate it on
provsql.cmp_probability_evaluation so all such evaluators
share one diagnostic switch.
Block-Independent Databases and Multivalued Inputs
By default, add_provenance associates one input gate
per tuple (created lazily on first reference), so each row of a
provenance-tracked base table is an independent Bernoulli variable.
That is the tuple-independent probabilistic database (TID) model.
ProvSQL additionally supports the strictly more general block-independent database (BID) model, in which input tuples are partitioned into blocks:
tuples within a block are pairwise disjoint – at most one of them is present in any possible world;
blocks are independent;
each tuple of a block has its own probability, with the per-block sum
; the residual
is the probability that no tuple from the block is present (the “null outcome”).
A TID is the special case where each block has exactly one tuple. BIDs are the natural circuit-level model for tables with key uncertainty: “exactly one of these rows is the real row, we don’t know which, and here are the weights”.
The gate_mulinput Gate
ProvSQL represents each BID block in the persistent circuit by a
group of gate_mulinput gates that share a common child, an
input gate acting as the block key. Each mulinput gate
corresponds to one alternative of the block and carries its own
probability (set with set_prob). mulinput gates
are not first-class leaves of the provenance DAG: semiring
evaluators do not know how to interpret them and will refuse
any circuit that contains one, and the probability pipeline
handles them only after rewriting the blocks into standard
Boolean gates – as described below.
The canonical way to create such gates from SQL is
repair_key, which takes a table with duplicate key
values, allocates one fresh input gate per key group, and
turns each member of the group into a mulinput whose child
is that block key. When no probabilities are attached,
repair_key defaults them to a uniform distribution over the
block members.
Rewriting Blocks into Independent Booleans
Most probability-evaluation algorithms require a purely Boolean
circuit: AND, OR, NOT, and independent Bernoulli leaves. A BID
block is not directly such a structure – its elements are
mutually exclusive, not independent.
BooleanCircuit::rewriteMultivaluedGates (in
BooleanCircuit.cpp) reduces every block to an
equivalent Boolean subcircuit by introducing
fresh independent Bernoulli variables per block of size
whose joint distribution reproduces the original
discrete weights.
The construction is divide-and-conquer. Given a block with
alternatives carrying cumulative probabilities
, the recursive helper
BooleanCircuit::rewriteMultivaluedGatesRec splits the
range at the midpoint
, creates one fresh
input gate g with
probability
– the conditional probability of landing in the left half –
and recurses twice: the left half gets g pushed onto its
prefix, the right half gets NOT g. At a leaf
(), the
mulinput gate is
rewritten into the AND of the accumulated prefix, so its truth
value becomes the conjunction of the fresh-variable decisions
that lead to it. If the block’s probabilities do not sum to 1,
the outer call wraps the whole construction in one more fresh
input of probability to carry the “none of them”
residual.
After rewriting, the block’s mulinput gates have been turned
into ordinary AND gates over fresh independent Boolean
inputs, and the circuit is ready for any TID-based probability
method. The dispatcher in
probability_evaluate_internal calls
BooleanCircuit::rewriteMultivaluedGates lazily: the
"independent" method handles mulinput gates natively and
runs on the raw circuit; every other method falls through to the
rewrite first. This is the pivot point referenced in
Step-by-Step: Adding a New Probability Evaluation Method below.
Shapley and Banzhaf Values
ProvSQL also exposes expected Shapley values and expected
Banzhaf values, which quantify the individual contribution of
each input tuple to the truth of a provenance circuit. The
user-facing interface is described in Shapley and Banzhaf Values;
this section covers the implementation in shapley.cpp
and dDNNF.cpp.
Expected Shapley values are #P-hard in general but become polynomial-time computable when the provenance is represented as a decomposable and deterministic (d-D) Boolean circuit – essentially a d-DNNF. The algorithm ProvSQL uses is Algorithm 1 of Karmakar, Monet, Senellart, and Bressan (PODS 2024, [Karmakar et al., 2024]), specialised to the two coefficient functions that define the Shapley and Banzhaf scores. Both scores are computed in expectation: the random subset of variables is drawn according to the per-variable probabilities of the circuit, and when no probabilities have been set, each defaults to 1 and the computation collapses to the standard deterministic Shapley / Banzhaf value.
Entry Point
shapley / banzhaf (and their set-returning
variants shapley_all_vars /
banzhaf_all_vars) are thin wrappers that unpack their
arguments and call shapley_internal in
shapley.cpp. That helper performs the following
sequence:
Build a
BooleanCircuitfrom the persistent store viagetBooleanCircuit.Build a
dDNNFby callingBooleanCircuit::makeDD. This is the same d-DNNF construction used for ordinary probability evaluation, and obeys the samemethod/argsconventions.dDNNF::makeSmooth– ensure that every OR gate’s children mention the same variable set. The paper’s algorithm assumes a smooth d-DNNF.For Shapley (but not Banzhaf):
dDNNF::makeGatesBinaryon AND – binarise AND gates so each has at most two children. Together, the previous two steps turn the d-DNNF into a tight d-D circuit in the paper’s sense.Call
dDNNF::shapleyordDNNF::banzhafon the target variable’s gate.
The Shapley Recurrence
The paper’s algorithm conditions the circuit on the target
variable being fixed to true (call the result )
and to
false (call it ), computes a pair of
per-gate arrays on each conditioned circuit, and combines them
into the final score. ProvSQL’s
dDNNF::shapley
mirrors that structure:
double dDNNF::shapley(gate_t var) const {
auto cond_pos = condition(var, true); // C_1
auto cond_neg = condition(var, false); // C_0
auto alpha_pos = cond_pos.shapley_alpha();
auto alpha_neg = cond_neg.shapley_alpha();
double result = 0.;
for (size_t k = ...; k < alpha_pos.size(); ++k)
for (size_t l = 0; l <= k; ++l) {
double pos = alpha_pos[k][l];
double neg = alpha_neg[k][l];
result += (pos - neg) / comb(k, l) / (k + 1);
}
result *= getProb(var);
return result;
}
dDNNF::condition returns a copy of the circuit in which
the target input gate has been replaced by an AND /
OR-with-no-children acting as the constant true /
false respectively. The private helper
dDNNF::shapley_alpha then performs a single bottom-up pass
computing a two-dimensional array
(called
result[g] in the code) at
every gate , where
is the number of variables
under
in the current cofactor and
is the
number of them that are positively assigned. The recurrences
follow the
IN / NOT / OR / AND cases of
Algorithm 1 of the paper:
At a leaf, the array encodes the Bernoulli distribution of that single variable.
At an OR gate, the arrays of the children are summed coordinatewise (valid because the d-DNNF is smooth, so all children have the same variable set).
At a binary AND gate, the arrays are convolved via a double sum over
pairs – the decomposability of AND makes this the Cauchy product of two independent distributions. This convolution is the reason AND gates have to be binarised before the algorithm runs.
A standalone bottom-up pass (
dDNNF::shapley_delta) precomputes thepolynomials, which the algorithm uses at
NOTgates to turn negation into a coefficient flip.
The final score is
,
where
comes from
and
from
, and
is the Shapley coefficient –
i.e.exactly the formula implemented above. The overall
complexity is
arithmetic operations,
dominated by the double-sum convolution at AND gates over the
-sized arrays.
The if (isProbabilistic()) guards inside
dDNNF::shapley_alpha and dDNNF::shapley_delta
short-circuit the polynomials to a single
top-level coefficient when all input probabilities are 1, so that
the same code path computes classical (deterministic) Shapley
values without paying the expected-score overhead.
Banzhaf
The expected Banzhaf value admits a much simpler formula [Karmakar et al., 2024]:
where can be computed in a single
linear pass over a smooth d-D circuit without binarising AND
gates.
dDNNF::banzhaf runs
dDNNF::banzhaf_internal on
the two conditioned circuits and
and
returns the difference times
; the overall
complexity is
, one factor of
less than Shapley. This is why
shapley_internal skips
the dDNNF::makeGatesBinary call in the Banzhaf branch.
Hybrid Evaluation for Continuous Distributions
When the circuit being evaluated contains continuous gates
(gate_rv, gate_arith, gate_mixture),
a hybrid evaluator runs before the Boolean dispatch above. Its
job is to fold every sub-circuit that has a closed-form analytical
answer into a Bernoulli leaf so the resulting circuit is a normal
Boolean circuit ready for any of the Boolean methods.
The hybrid evaluator has three passes:
Peephole pruning (
runRangeCheck): support intervals propagate throughgate_arith, everygate_cmpis tested against the propagated interval, and every comparator decidable from the support alone collapses to a Bernoulligate_inputwith probability0or1.Family-closure simplifier (
runHybridSimplifier): linear combinations of independent normals fold into a single normal; sums of i.i.d. exponentials with the same rate fold into an Erlang; identity / single-child arith gates and semiring identities collapse.Island decomposition (
runHybridDecomposer): the remaining cmps are partitioned by base-RV footprints into islands; single-cmp islands marginalise viarunAnalyticEvaluator’s closed-form CDF; multi-cmp islands with shared base RVs go through the joint table.
See Continuous Distributions for the full simplifier rule set and the island-decomposition algorithm.
Conditional Evaluation
expected / variance / moment /
central_moment / support / rv_sample /
rv_histogram all accept an optional prov uuid argument
that conditions the moment, sample, or histogram on the provenance
event prov. When prov resolves to anything other than
gate_one, evaluation routes through the joint-circuit
loader getJointCircuit (MMappedCircuit.cpp),
which performs a multi-rooted BFS over the union of the reachable
gates from both input and prov so shared gate_rv
leaves between the two are loaded into a single
GenericCircuit and consequently couple correctly in the
Monte Carlo sampler’s rv_cache_. The closed-form
truncated-distribution table is exhaustive for Normal (Mills
ratio), Uniform (intersected support), and Exponential
(memorylessness); other shapes fall back to MC rejection sampling
at provsql.rv_mc_samples budget. See
Continuous Distributions for depth.
The Inversion-Free UCQ(OBDD) Path
The 'inversion-free' method (and the default-chain rung that follows
independent) evaluates the inversion-free class of Jha and
Suciu [Jha and Suciu, 2011]: hierarchical, tuple-independent
queries – including self-joins – whose lineage admits a polynomial-size
OBDD. On these the generic 'tree-decomposition' / compilation
fallbacks can blow up (the lineage is not low-treewidth), yet a
structured d-DNNF built over a query-derived variable order stays
linear in the lineage.
This path is a sibling of the Safe-Query Rewriter, and the two are complementary:
The safe-query rewriter (
provsql.boolean_provenanceon) restructures the query so the planner emits a read-once circuit, whichindependentthen evaluates almost for free. It applies only to the read-once (safe) class and changes the produced circuit.The inversion-free path leaves the lineage intact and evaluates the naive circuit – which, even for a safe query, is generally not read-once (e.g.
q(x) :- B(x), A(x,y)yields⋁_y (B(x) ∧ A(x,y)), repeatingB(x)), soindependentrejects it. It also covers the strictly larger inversion-free-but-not-read-once self-join class. It is decoupled fromboolean_provenanceand gated on its own GUC,provsql.inversion_free(on by default).
The pipeline has four stages.
- Detection (
src/safe_query.c) detect_inversion_freechecks the four preconditions (hierarchical, strictly tuple-independent atoms, positional consistency, acyclic precedence) and, on success, builds aSafeCertrecipe describing the query-derived (Prop. 4.5) variable order. It reuses the candidate gate and union-find machinery of the safe-query rewriter but is not gated onboolean_provenance:process_queryruns it on the lineage query wheneverprovsql.inversion_freeis on, after (and only when) the read-once rewrite did not already fire.A non-tracked base relation (no
provsqlcolumn and no metadata entry) is deterministic: it contributes only probability-1 tuples and anchors no provenance variable, so the detector erases it from the root, positional, precedence and marker passes while keeping its join equalities (it still filters the cross product). This mirrors the read-once path’s dissociation transparency, with the same soundness guards (a plain table, not a matview / foreign table / partitioned parent / inheritance child), and only enlarges the certified class.- Flattening pre-pass (
src/provsql.c) build_inversion_free_ctxruns the detector on a flattened copy of the lineage query so that SPJ subqueries and views are recognised.flatten_spj_subqueriesinlines every non-lateral SPJ subquery slot (no aggregation, grouping,DISTINCT, set operation, sublink, CTE orLIMIT; flatRangeTblRefFROMover base relations; target list all plain baseVars) into its base atoms – substituting the parent’s column references, pulling the subqueryWHEREup and rebuilding a flatFROM– and recurses, so a view-over-view or nested derived table collapses to base atoms first. A view referenced k times inlines to k copies of its base atoms: a structured self-join the inversion-free path handles natively. On PostgreSQL 18 the syntheticRTE_GROUPof aGROUP BYquery is stripped from the copy first. The original query is left intact; only transparent markers and a root certificate are added.- Certificate and per-input markers (
src/safe_query_cert.{h,c}) The recipe and the order are carried into the circuit on transparent
gate_annotationgates (see Architecture Overview):the serialised
SafeCertis stamped on the per-row root as aC-prefixedextrapayload;each certified atom’s input is wrapped (via
annotate) in an annotation carrying aK-prefixed order key(root, sec, factor)(SafeCertKey), emitted by the planner (build_inversion_free_markerinsrc/provsql.c) via theinversion_free_keySQL function. An atom binding only the head class is root-only (no secondary column); a relation whose occurrences span two or more secondary classes is the shared self-join guard (factor = SAFE_CERT_GUARD_FACTOR).The
rootandsecclass values are carried as length-prefixed value text (the column type’s I/O output), so the key works for any scalar key column –text(including spaces / colons),uuid,date,numeric… – not just integers. The builder uses them only for grouping (equal text ⇒ same block / tile) and a consistent total order, both of which any injective type rendering satisfies.
For a view inlined by the flattening pre-pass the markers wrap the base inputs inside the subquery, threaded down through the recursive rewrite by a per-query
InvFreeMarkerCtxcontext tree (the certificate stays on the parent’s per-row root).Both markers are inert at evaluation: the annotation gate is identity for every evaluator, so a query carrying them evaluates identically whether or not the analysis ran.
- Structured d-DNNF builder (
src/StructuredDNNF.{h,cpp}) StructuredDNNFBuildercompiles the monotone lineage top-down into a ProvSQLdDNNF: it expands the circuit to a canonical DNF and recurses with decomposable AND at independence points (variable-disjoint factors) and deterministic OR at Shannon decisions on the supplied variable order, threading a false-sink through OR-chains and sharing equal sub-d-DNNFs through a component cache. The order affects only the d-DNNF size, never correctness, so the builder is sound on any monotone lineage; the Prop. 4.5 order is what keeps it polynomial on the certified class. Multivalued (BID) andNOTgates are out of scope and rejected with aCircuitException.- Dispatch (
src/probability_evaluate.cpp) collect_inversion_free_keyswalks the circuit for theK-marker annotations and maps each wrapped input to itsInputKey;inversion_free_rankflattens those keys into a total rank (root value, then secondary value, then guard-before-payload, then factor) for the order-only builder. The explicit'inversion-free'method requires the certificate and errors without it; the default chain takes this rung only when a certificate is present andprovsql.inversion_freeis on, afterindependentand before tree-decomposition, catchingCircuitExceptionto fall through.
Shapes the analysis does not model cause detection to decline (no
certificate): a BID/gate_mulinput atom, a subquery the flattening
pre-pass cannot inline (an aggregating view, a set-operation / UNION
view, a correlated or LATERAL subquery), or a flattened conjunction
that is genuinely non-hierarchical (the H-query R(x),S(x,y),T(y)). A
malformed C-prefixed payload fails to parse and is treated as an inert
annotation. In every case evaluation falls back to the normal chain and
stays correct. These declines – and the positive cases (self-joins,
non-integer key columns, deterministic-relation filters, single- and
multi-relation SPJ views, views-over-views) – are covered by
test/sql/safe_query_inversion_free.sql.
Step-by-Step: Adding a New Probability Evaluation Method
The work is almost entirely in two files. Pick a short, descriptive
method string – it is the value the user passes to
probability_evaluate.
Declare the method on
BooleanCircuit.h:double myMethod(gate_t g, const std::string &args) const;
Implement it in
BooleanCircuit.cpp. The method receives the root gate and the user-suppliedargsstring (may be empty) and must return a probability in. Check
provsql_interruptedperiodically if the computation is long so that the user can cancel withCtrl-C:double BooleanCircuit::myMethod(gate_t g, const std::string &args) const { // Parse args if needed. // Run the algorithm, respecting provsql_interrupted. // Return the probability. }
Add a dispatch branch in
probability_evaluate_internalinprobability_evaluate.cpp. The exact location depends on the method’s characteristics:If the algorithm requires a Boolean circuit (no multivalued inputs – see Block-Independent Databases and Multivalued Inputs), add the branch after the call to
BooleanCircuit::rewriteMultivaluedGates. That is the case for most approximate methods.If the algorithm operates directly on the raw circuit (like
independent), add it beforeBooleanCircuit::rewriteMultivaluedGates.If the algorithm produces a d-DNNF and you want it to benefit from the linear-time d-DNNF evaluator, add it to
BooleanCircuit::makeDDinstead and route your dispatch branch throughBooleanCircuit::makeDD.
Example for an approximate method that takes a numeric argument:
} else if(method == "mymethod") { int param; try { param = std::stoi(args); } catch(const std::invalid_argument &) { provsql_error("mymethod requires a numeric argument"); } result = c.myMethod(gate, param); }
(Optional) Extend the default fallback chain. If the method is a good universal choice, update
BooleanCircuit::makeDDand/or the default branch inprobability_evaluate_internalto try it before falling back tocompilationwithd4.Add a regression test under
test/sql/and register it intest/schedule.common. Follow the skip-if-missing pattern from the other external-tool tests (see Testing) if the new method depends on an external binary.Update the user documentation in Probabilities and add a row for the new method to the “Currently supported methods” table above.