Conditioning

ProvSQL can compute conditional probabilities, distributions, and expectations: the answer to a query given that some other event is known to hold. The single entry point is the conditioning operator |, read as “given”: A | B means “A given B”. It works identically over the three carriers ProvSQL tracks – discrete events, continuous random_variable values, and probabilistic agg_token aggregates – and it is exact and correlation-aware: the conditional is computed from the shared provenance circuit, so events that overlap are accounted for without any independence assumption or hand-written inclusion–exclusion.

This chapter documents the operator and its variants. For the underlying probability machinery see Probabilities, for the continuous surface see Continuous Distributions, and for a worked, end-to-end tour see Case Study: ProvSQL as a Probability Calculator.

What conditioning means

For two Boolean events A and B, A | B denotes the conditional probability in the textbook sense, Bayes’ rule:

$\Pr(A \mid B) = \frac{\Pr(A \wedge B)}{\Pr(B)}.$

ProvSQL realises this by building a terminal conditioned gate over the two provenance tokens, whose probability evaluators read as exactly that ratio. Because gates are addressed by content, a base tuple shared between A and B is literally the same input gate in both circuits, so the joint $\Pr(A \wedge B)$ is computed over the real overlap – the conditional is correct even when A and B are correlated.

Two conventions follow from the definition:

Conditioning on a certain or absent event is a no-op. A | B returns A unchanged when B is the certain event or is NULL ( $\Pr(A \mid \text{true}) = \Pr(A)$ ).
Nested conditioning folds as a sequential Bayesian update: (A | B) | C is the same as A | (B ∧ C). The conditioned gate never nests; it stays one level deep with the evidence accumulated.

The operator family

| comes in a binary and a unary (prefix) form, each accepting either a provenance token or a Boolean predicate on the right.

Binary `value | evidence` – conditioning a value

The binary operator conditions one value – a discrete event token, a random_variable, or an agg_token – on an evidence event, and returns a new conditioned value of the same carrier. The right operand is either a provenance token,

-- P(disease | positive) over a shared screening model
SELECT probability_evaluate(disease | positive)
FROM (SELECT (SELECT provenance() FROM screening WHERE disease  GROUP BY ()) AS disease,
             (SELECT provenance() FROM screening WHERE positive GROUP BY ()) AS positive) e;

or a Boolean predicate built from random_variable / agg_token comparisons, which the planner lifts into an evidence gate for you:

-- a biomarker conditioned on exceeding a referral threshold
WITH r AS (SELECT normal(20, 5) AS x)
SELECT expected(x | (x > 25)) FROM r;

The evidence may also compare the value against another random variable. Conditioning on such an RV-vs-RV comparison is evaluated by a one-dimensional quadrature over the independent operands – exact for uniforms:

-- E[X | X > Y] = 2/3 for two i.i.d. U(0,1); exact with MC disabled
SET provsql.rv_mc_samples = 0;
WITH r AS (SELECT uniform(0,1) AS x, uniform(0,1) AS y)
SELECT expected(x | (x > y)) FROM r;

The dual question – the probability a comparison holds rather than a conditioned moment – is probability over the predicate: probability(x > y) for two i.i.d. uniforms is exactly 0.5. See Continuous Distributions for the comparison-event surface.

The evidence side may itself be a comparison rather than a pre-built token, so a conditional probability over two comparison events is written directly: probability((A) | (B)) is the correlation-aware $\Pr(A \wedge B) / \Pr(B)$ . For example, with x ~ Normal(1500, 400) and the nested events {x >= 2000} ⊂ {x >= 1000}:

WITH r AS (SELECT normal(1500, 400) AS x)
SELECT probability((x >= 2000) | (x >= 1000)) FROM r;  -- 0.1181

Both comparisons share the x leaf, so the joint keeps the dependence (it is not Pr(A)·Pr(B)); comparisons against constants on one distribution are resolved analytically and stay exact even at provsql.rv_mc_samples = 0.

The same holds when a shared random variable is compared against other random variables rather than constants – e.g. conditioning on which of several variables is largest. Both the probability and the moment sides are exact (no Monte Carlo):

WITH r AS (SELECT uniform(0,1) x, uniform(0,1) y, uniform(0,1) z)
SELECT probability((x > y) | (x > z))          AS pr,      -- 0.6667
       expected(x | (x > y AND x > z))         AS cond_e,  -- 0.75
       variance(x | (x > y AND x > z))         AS cond_v   -- 0.0375
FROM r;

The event x > y AND x > z says x is the largest of the three, so x conditioned on it is beta(3,1) (mean 3/4, variance 3/80). ProvSQL evaluates these by marginalising each independent partner variable analytically, leaving a one-dimensional integral over the shared variable – exact for the uniform case and high-accuracy otherwise.

The result of value | evidence is terminal: a conditioned value may only be conditioned further, never combined into a larger plus / times / monus / aggregate gate.

On the token carrier (plain uuid values), the function spelling of the binary operator is cond (target, evidence), interchangeable with target | evidence; the random_variable and agg_token carriers back their | with random_variable_cond and agg_token_cond respectively.

Unary `| evidence` – conditioning a whole tuple

Written as a term in the SELECT list (with no left operand), | evidence is a whole-tuple directive: it conditions the output provenance of every row of the query on the evidence, then is stripped from the visible projection. The function spelling is given (evidence), the same as the prefix | evidence.

-- condition every output row on "obesity is present", in one directive
SELECT factor, probability_evaluate(provenance()) AS p_given_obesity
FROM (
  SELECT factor, | (SELECT provenance() FROM risk WHERE factor = 'obesity')
  FROM risk
) s
ORDER BY factor;

Here the | (...) term contributes no column; instead each surviving row’s provenance becomes provenance() | <evidence>. Over the correlation model of case study 8 this returns 1 for obesity itself (an event given itself is certain) and each independent factor’s prior unchanged (insulin_resistance 0.6, high_salt 0.7).

The evidence is evaluated per output row and may reference the row’s own columns, so each tuple can be conditioned on its own evidence – the typical use being a correlated sub-select:

-- each patient's row conditioned on that same patient testing positive
SELECT p.id, p.name,
       | (SELECT provenance() FROM tests t
          WHERE t.patient_id = p.id AND t.result = 'positive')
FROM patients p;

The unary predicate form | (predicate) works the same way, with the evidence written as a Boolean combination of random_variable / agg_token comparisons.

Note

The unary given / | directive conditions the whole-tuple existence event (the row’s provenance token), not the value of any random_variable or agg_token column you happen to select. To condition a value – to truncate a distribution, say – use the binary value | evidence form. The unary directive is only accepted in a plain per-row SELECT; it is rejected on an aggregated, grouped, DISTINCT, or set-operation (UNION / EXCEPT …) query, where the individual tokens should be conditioned with the binary | instead.

Negating an event – `! event`

The prefix ! operator (function spelling provenance_not (event)) is the complement of a Boolean provenance event: !x holds in exactly the worlds where x does not, so probability_evaluate(!x) is $1 - \Pr(x)$ . Unlike the conditioned gate, ! is an ordinary m-semiring expression – Boolean negation, $\mathbb{1} \ominus x$ underneath – so it composes freely under times / plus; the one thing it refuses is a conditioned (terminal) token.

Its natural use with conditioning is a denial constraint: restricting a query to the worlds where some forbidden pattern does not occur. The violation event W is just an ordinary query – no hand-built gates – aggregated to a single token with provenance() ... GROUP BY (), and the query is conditioned on its negation, Q | !W:

-- P(booking 1 is present | no two overlapping bookings of the same room)
WITH w AS (
  SELECT provenance() AS violation        -- W = "some overlapping pair exists"
  FROM bookings a JOIN bookings b
    ON a.id < b.id AND a.room = b.room
       AND a.lo < b.hi AND b.lo < a.hi
  GROUP BY ())
SELECT probability_evaluate(
         (SELECT provenance() FROM bookings WHERE id = 1) | !w.violation)
FROM w;

The constraint can be any query: a forbidden pattern expressed as a query becomes a denial constraint by conditioning on the negation of “the pattern occurs”, so !W is the event “no violation” and Q | !W restricts Q to exactly the worlds the constraint admits. This is the MarkoViews construction [Jha and Suciu, 2012] – conditioning a probabilistic database on the event that no constraint is violated. ! is also useful on its own, wherever the complement of an event is wanted (probability_evaluate(!a), a AND NOT b as times(a, !b)).

The three carriers

Discrete events

A Boolean provenance token stands for the event “this row exists”. Conditioning one token on another yields a conditional probability:

-- base-rate fallacy: a positive screening test, P(disease | positive)
SELECT probability_evaluate(disease | positive) AS disease_given_pos
FROM (SELECT (SELECT provenance() FROM screening WHERE disease  GROUP BY ()) AS disease,
             (SELECT provenance() FROM screening WHERE positive GROUP BY ()) AS positive) e;
--  ≈ 0.1538

Random variables

Conditioning a random_variable produces another random_variable – the conditional distribution – whose mean, variance, and range you read with expected, variance, and support:

WITH r AS (SELECT normal(20, 5) AS x)
SELECT expected(x | (x > 25))     AS cond_mean,      -- ≈ 27.6
       variance(x | (x > 25))     AS cond_variance,  -- ≈ 5  (was 25)
       (support(x | (x > 25))).lo AS lowest_value    -- 25  (truncated floor)
FROM r;

Conditioning on a threshold predicate truncates the distribution and renormalises it; the result is a value in its own right that you can select, store, or hand onward. See Continuous Distributions for the closed-form truncation table (Normal, Uniform, Exponential) and the Monte-Carlo fallback for other shapes.

Probabilistic aggregates

When the rows being aggregated are themselves uncertain, the total is an agg_token. Conditioning it on an observation gives a conditional expectation:

-- expected regional total, given that the high-count day really happened
SELECT expected(total | (SELECT provenance() FROM cases WHERE n = 4))
FROM casesum WHERE region = 'North';

The aggregate-specific spelling expected (aggregate, condition) (see Probabilities) is the same operation; the | operator is the uniform way to write it across all three carriers.

In ProvSQL Studio

Studio’s evaluation strip exposes conditioning interactively: the Condition on input takes an evidence provenance UUID, auto-presetting to a clicked row’s own provenance, with an adjacent Conditioned by badge that lights up while the result is being conditioned on it – click the badge to toggle the conditioning off (an unconditional result) and back on. Distribution profiles, moments, and probabilities all honour it, so the truncated histogram of a conditioned random_variable and the conditional mean of an agg_token are visible in the canvas. See ProvSQL Studio for the panel and case study 6 for it in use.