Probabilities

ProvSQL can compute the probability that a query answer holds in a probabilistic database [Green and Tannen, 2006] – a database where each provenance token has an independent probability, which is in turned used to determine probabilities of existence of specific tuples. The same machinery also handles continuous random variables at the value level (see Continuous Distributions): a column of type random_variable carries a distribution per row, and filter predicates on that column lift into the row’s provenance circuit. RV-bearing queries route through Monte Carlo by default, with the hybrid evaluator routing closed-form sub-circuits to the analytical path.

Setting Probabilities

Assign a probability to each input tuple’s provenance token using set_prob:

SELECT set_prob(provenance(), 0.8) FROM mytable WHERE id = 1;

Or in bulk, from a column of the table itself:

SELECT set_prob(provenance(), reliability) FROM sightings;

Probabilities must be in the range [0, 1].

To read back a stored probability with get_prob:

SELECT get_prob(provenance()) FROM mytable;

Computing Query Probabilities

Use probability_evaluate to evaluate the probability that a query result holds, given the assigned input probabilities:

SELECT person,
       probability_evaluate(provenance()) AS prob
FROM suspects;

The function accepts an optional second argument specifying the computation method, and an optional third argument for method-specific parameters.

ProvSQL Studio’s evaluation strip exposes probability_evaluate interactively, with method and arguments selectors.

Computation Methods

'independent'

Exact computation assuming all input tokens are mutually independent. Fails with an error if the circuit is not independent:

SELECT probability_evaluate(provenance(), 'independent') FROM suspects;

'possible-worlds'

Exact computation by exhaustive enumeration of all possible worlds. Exponential in the number of provenance tokens; practical only for small circuits:

SELECT probability_evaluate(provenance(), 'possible-worlds') FROM suspects;

'monte-carlo'

Approximate computation by random sampling. The third argument sets the number of samples (default: 1000):

SELECT probability_evaluate(provenance(), 'monte-carlo', '10000')
FROM suspects;

'tree-decomposition'

Exact computation via a tree decomposition of the Boolean circuit [Amarilli et al., 2020]. Built-in; no external tool required. Fails if the treewidth exceeds the maximum supported value:

SELECT probability_evaluate(provenance(), 'tree-decomposition')
FROM suspects;

'compilation'

Exact computation by first compiling the circuit to a d-DNNF using an external tool, then evaluating the d-DNNF. The third argument names the tool: 'd4' (by default), 'c2d', 'dsharp', or 'minic2d':

SELECT probability_evaluate(provenance(), 'compilation', 'd4')
FROM suspects;

The tool must be installed and accessible in the PostgreSQL server’s PATH, or in a directory listed in the provsql.tool_search_path GUC (see Configuration Reference).

'weightmc'

Approximate weighted model counting using the external weightmc tool:

SELECT probability_evaluate(provenance(), 'weightmc')
FROM suspects;

Same PATH / provsql.tool_search_path considerations as 'compilation'.

Default strategy (no second argument)

ProvSQL tries each method in order until one succeeds:

Independent evaluation – used if the circuit is independent.
Tree decomposition – used if the treewidth is within the supported limit.
Compilation with d4 – used as a final fallback; requires d4 to be installed.

Expected Values of Aggregates

For aggregate queries over a probabilistic table, the expected function computes the expected value of the aggregate result. It supports COUNT, SUM, MIN, and MAX:

SELECT dept,
       expected(COUNT(*)) AS expected_count,
       expected(SUM(salary)) AS expected_salary
FROM employees
GROUP BY dept;

An optional second argument specifies a provenance condition for computing a conditional expectation E[aggregate | condition]. For instance, to compute the expected count within each group conditioned on the group existing (i.e., its provenance being true):

SELECT dept,
       expected(COUNT(*), provenance()) AS conditional_count
FROM employees
GROUP BY dept;

Without the second argument, the expectation is unconditional. With it, the result is normalized by the probability of the condition.

HAVING with Probabilities

HAVING clauses are partially supported in the probabilistic setting. The following aggregate functions in HAVING are handled: COUNT, SUM, AVG, MIN, MAX:

SELECT dept, probability_evaluate(provenance())
FROM employees
GROUP BY dept
HAVING COUNT(*) > 2;

Independent Tuples and Block-Independent Databases

ProvSQL assumes all input provenance tokens are independent. Since by default, provenance tokens are assigned fresh to each tuple on base tables, correlation between tuples is not modelled. If you need correlated probabilities, model them explicitly by coding the correlations with queries, the resulting tables will have correlated tuples.

A common case of correlated data is a block-independent database, where tuples are grouped into mutually-exclusive blocks (exactly one tuple per block is assumed to be true). repair_key restructures the provenance circuit to enforce this mutual exclusivity: it takes a table and a key attribute, and rewrites each group of tuples sharing the same key value into independent, mutually-exclusive alternatives.

CREATE TABLE weather(context VARCHAR, weather VARCHAR, ground VARCHAR,
                     p FLOAT);
INSERT INTO weather VALUES
  ('day1', 'rain',    'wet', 0.35),
  ('day1', 'rain',    'dry', 0.05),
  ('day1', 'no rain', 'wet', 0.10),
  ('day1', 'no rain', 'dry', 0.50);

-- Make tuples with the same context mutually exclusive
SELECT repair_key('weather', 'context');

-- Assign probabilities and evaluate
SELECT set_prob(provenance(), p) FROM weather;

SELECT ground,
       ROUND(probability_evaluate(provenance())::numeric, 3) AS prob
FROM (SELECT ground FROM weather GROUP BY ground) t;

Boolean-Provenance Optimisations

Probability evaluation routes through the Boolean-circuit pipeline (getBooleanCircuit then one of the methods above). Two optimisations exploit Boolean-specific structure to make this faster, sometimes by orders of magnitude.

Safe-query rewriting (`provsql.boolean_provenance`)

When the GUC provsql.boolean_provenance is on (off by default), the planner recognises the safe hierarchical conjunctive-query subclass of Dalvi-Suciu [Dalvi and Suciu, 2012] and rewrites such queries with per-atom DISTINCT projections so that the resulting provenance circuit is read-once. A read-once circuit can be probability-evaluated in linear time by the 'independent' method, instead of falling through to 'tree-decomposition' or external compilation.

The rewriter recognises self-join-free hierarchical conjunctive queries over TID or BID base tables, plus a number of extensions that recover safety for query shapes the raw hierarchical criterion would reject (FD-aware reductions driven by primary keys / NOT-NULL UNIQUE constraints, constant selections, transparent deterministic relations, certain self-joins, UCQs with disjoint branches, …); see Safe-Query Rewriter in the developer documentation for the full set. Queries outside the recognised class are passed through unchanged: the GUC enables an opt-in shortcut, never a different result.

SET provsql.boolean_provenance = on;

SELECT person, ROUND(probability_evaluate(provenance())::numeric, 4)
FROM suspects, witnesses
WHERE suspects.case_id = witnesses.case_id;

Trade-off. The rewriter tags the root gate so that semiring evaluators incompatible with Boolean rewriting refuse to run on the result (see Semiring Evaluation for the compatibility list). In practice this means: turn the GUC on for probability-heavy workloads on hierarchical CQs, turn it off (or re-evaluate in a fresh session) before running sr_counting, sr_how, sr_why on the same circuit.

HAVING-COUNT closed-form shortcut

For queries of the form GROUP BY g HAVING COUNT(*) op c (where op is one of >=, >, <=, <, =, <>) ProvSQL recognises the HAVING comparator as a Poisson-binomial CDF over the per-row Bernoulli indicators and computes its probability directly, in O(N × min(C, N-C)) per group, where N is the per-group row count. This replaces the binomial-coefficient-sized DNF that the general HAVING evaluator would otherwise construct, so HAVING-COUNT probability queries that previously hit 'tree-decomposition' or 'compilation' now resolve in milliseconds.

The shortcut fires automatically when its soundness preconditions are met: each per-row provenance must be a single gate_input leaf and the group-level aggregate must not be shared with any other comparator. It is transparent to the user; no GUC needs to be set. Queries outside the supported shape (HAVING-SUM, HAVING-MIN/MAX, multi-cmp HAVING such as COUNT(*) >= a AND COUNT(*) <= b, or non-trivial per-row provenance from joins) fall through to the general HAVING path.

Continuous Random Variables

The discrete-Bernoulli setting above can be combined with a continuous tier: columns of type random_variable carry distributions (Normal, Uniform, Exponential, Erlang, Categorical, Mixture) rather than scalars, and WHERE predicates on these columns are rewritten into conditioning events on the row’s provenance. Evaluation routes through Monte Carlo by default, with a hybrid evaluator falling back to analytical closed forms where applicable (RangeCheck for support-decidable comparators, exact CDFs for single-distribution $\mathrm{gate\_cmp}$ , family-closure simplification for linear combinations of normals, …). See Continuous Distributions for the full surface.