Probability distributions over Boolean variables #
This file defines the intensional probability semantics underlying ProvSQL's probabilistic query evaluation, following Section IV-D of Sen, Maniu & Senellart, ProvSQL: A General System for Keeping Track of the Provenance and Probability of Data.
Given a finite set X of Boolean variables and an assignment Pr : X → ℚ
of probabilities (with values in [0, 1]), we extend Pr to:
- a probability distribution over valuations
v : X → Bool, assuming the variables are independent:Pr(v) = ∏_{v(x)=⊤} Pr(x) · ∏_{v(x)=⊥} (1 - Pr(x)); - a probability of a Boolean function
f : BoolFunc X, defined as the sum ofPr(v)over satisfying valuations:Pr(f) = ∑_{v ⊨ f} Pr(v).
This is the foundation needed to state and prove Theorem 12 of the paper
(intensional probabilistic query evaluation correctness): for any
non-aggregation query q, any BoolFunc X-instance Î and any tuple t,
Pr(t ∈ q(Î)) = Pr(⋁_{(t,α) ∈ ⟪q⟫^Î} α). The theorem itself is stated below
and the algebraic-bridge lemmas are developed; the full proof depends on a
“Bool-annotated semantics tracks plain semantics on the boolean-true support”
result that is left as future work.
Main definitions #
ProbAssignment X– a probability assignment to each variable, bundled with0 ≤ Pr(x) ≤ 1.ProbAssignment.valProb–Pr(v)for a single valuationv : X → Bool.ProbAssignment.funcProb–Pr(f)for a Boolean functionf : BoolFunc X.
Main results #
ProbAssignment.valProb_nonneg,valProb_le_one,sum_valProb_eq_one– basic properties of the valuation distribution.ProbAssignment.funcProb_zero,funcProb_one,funcProb_nonneg,funcProb_le_one– basic properties ofPr(f).ProbAssignment.funcProb_congr– pointwise-equal Boolean functions have equal probabilities.
References #
- Sen, Maniu & Senellart (Section IV-D)
A probability assignment to a finite set X of Boolean variables: each
variable is assigned a rational probability in [0, 1].
- prob : X → ℚ
The probability assigned to each variable.
Probabilities are non-negative.
Probabilities are at most
1.
Instances For
Probability of a single valuation v : X → Bool, under the independence
assumption: Pr(v) = ∏_{v(x)=⊤} Pr(x) · ∏_{v(x)=⊥} (1 - Pr(x)).
Instances For
The valuations form a probability distribution: ∑ v, Pr(v) = 1.
Pr(f) ≤ ∑_v Pr(v) = 1.
Pr(0) = 0: the constant-false function has probability zero.
Pr(1) = 1: the constant-true function has probability one.
Pointwise-equal Boolean functions have equal probabilities.
Reformulation: Pr(f) as a sum over the satisfying valuations.
For finite X, equality of Boolean functions (X → Bool) → Bool is
decidable in principle (since the function space is finite). We expose the
classical decidability instance so that Query.evaluateAnnotated, which
requires [DecidableEq K], can be invoked for K = BoolFunc X.
Equations
Relation T n is a def for Multiset (Tuple T n), so the standard
Membership instance does not propagate automatically. Re-expose it so that
t ∈ q.evaluate d typechecks. (We deliberately do not use this instance
inside the body of randomWorld, which returns a bare Multiset so that
Multiset.mem_map rewrites apply cleanly.)
Equations
Membership in AnnotatedRelation (a def-wrapped Multiset).
Equations
Random worlds and the disjunctive tuple annotation #
We now move toward Theorem 12 of the paper. Two pieces of infrastructure are
needed: the random world of a BoolFunc X-annotated relation under a
valuation v : X → Bool (the plain relation containing exactly the data
parts of the annotated tuples whose annotation evaluates to true at v),
and the disjunctive tuple annotation ⋁_{(t,α) ∈ r} α (a single Boolean
function summarising all the ways t can appear in r).
The disjunctive tuple annotation tupleAnnotation r t = ⋁_{(t,α) ∈ r} α:
the OR over the annotations of all annotated tuples in r whose data part
equals t. In the m-semiring BoolFunc X, multiset sum is pointwise OR, so
this is exactly the disjunction the paper writes inside Pr(·) on the
right-hand side of Theorem 12.
Equations
- tupleAnnotation r t = (Multiset.map Prod.snd (Multiset.filter (fun (p : AnnotatedTuple T (BoolFunc X) n) => p.1 = t) r)).sum
Instances For
The random world of a BoolFunc X-annotated relation under a valuation
v: the plain relation consisting of the data parts of the annotated tuples
whose annotation evaluates to true at v. The return type is a bare
Multiset (Tuple T n) (rather than Relation T n) so that Multiset lemmas
apply without an extra unfolding step.
Equations
- randomWorld v r = Multiset.map Prod.fst (Multiset.filter (fun (p : AnnotatedTuple T (BoolFunc X) n) => p.2 v = true) r)
Instances For
Pointwise random world of a BoolFunc X-annotated database: each
annotated relation is replaced by its plain random-world projection.
Equations
Instances For
Pointwise meaning of tupleAnnotation #
(tupleAnnotation r t)(v) = true iff some annotated tuple (t, α) ∈ r has
α(v) = true. This is the connection between the disjunction-on-the-right
of Theorem 12 and the random-world picture on the left.
Pointwise reading of tupleAnnotation. (tupleAnnotation r t)(v) = true
exactly when the random world at v of r contains t.
Marginal probability and the statement of Theorem 12 #
The marginal probability Pr(t ∈ q(Î)) is defined as the sum over valuations
v of Pr(v) indexed by whether t appears in q.evaluate (Î.randomWorld v).
This is the standard “intensional” definition: enumerate possible worlds,
weight each by its probability, and accumulate the indicator that the query
output contains t.
The paper writes the same thing as ∑_J [t ∈ ⟦q⟧(J)] · Pr(J) over
sub-instances J ⊆ Î. The two sums agree because, for each valuation v,
the unique J whose characteristic Boolean function Φ_J(Î) is satisfied at
v is exactly J(v) = { (u, α) ∈ Î | α(v) = true }, whose data side is
Î.randomWorld v.
Marginal probability that the tuple t appears in the output of q
when evaluated on a random world of Î.
Equations
- P.marginalProb q Î t = ∑ v : X → Bool, if t ∈ q.evaluate (AnnotatedDatabase.randomWorld v Î) then P.valProb v else 0
Instances For
Random worlds commute with annotated query evaluation #
The structural heart of Theorem 12 is the following commutation: for any
non-aggregation query q, taking the random world of the annotated query
result gives the same multiset as evaluating q on the plain random-world
database.
randomWorld v (evaluateAnnotated q Î) = q.evaluate (Î.randomWorld v)
Once this holds, Theorem 12 follows by summing Pr(v) weighted by the
matching indicators over v, using tupleAnnotation_apply_eq_true_iff on
the right-hand side and the definition of marginalProb on the left.
Helper lemmas: random-world commutes with Multiset operations #
randomWorld is additive on relations: filtering and projecting the
data side commutes with multiset sum.
Filtering the data side commutes with randomWorld v.
Mapping the data side commutes with randomWorld v. Proved by
Multiset.induction_on, with all Multiset.filter / Multiset.map lemmas
called with named (p := ...) / explicit-type arguments so Lean's HOU does
not pick a wrong decomposition and so the underlying Lex-unfolded carrier
type matches between goal and rewrite.
Random world commutes with find #
Diff annotation helper #
For the Diff case of the structural commutation theorem we need to
characterise when the annotation subtracted from r₁'s entries evaluates to
false at the valuation v: this happens exactly when the data tuple is
not in the random world of r₂.
The Diff subtraction-annotation evaluates to false at v iff the
data tuple is absent from the random world of the subtracted relation.
The structural commutation theorem #
Random-world projection commutes with annotated query evaluation: for any
non-aggregation query q, taking the random world v of the annotated
result is the same as evaluating q on the random-world database. The proof
is a structural induction, with the Prod, Dedup, and Diff cases still
to be discharged.
Theorem 12 (Sen, Maniu & Senellart, Section IV-D).
For any non-aggregation query q, any BoolFunc X-annotated database Î
and any tuple t, the marginal probability that t appears in the random
output of q equals the probability of the disjunctive tuple annotation
of t in the annotated query result ⟪q⟫^Î.
This is the formal justification for ProvSQL's intensional approach to
probabilistic query evaluation: instead of enumerating exponentially-many
possible worlds, evaluate the query once over BoolFunc X-annotations and
take the probability of the resulting Boolean function.
The proof reduces to (a) tupleAnnotation_apply_eq_true_iff, the pointwise
reading of the disjunctive annotation, and (b) randomWorld_evaluateAnnotated,
the commutation of plain query evaluation with random-world projection.
Corollary 13: probability via the plain rewritten query #
Theorem 12 expresses the marginal probability Pr(t ∈ q(Î)) as the
probability of the disjunctive tuple annotation of t in the annotated query
result ⟪q⟫^Î. Combining it with the rewriting-correctness theorem
Query.rewriting_valid (Theorem 10 of Sen, Maniu & Senellart,
rules R1–R5) gives the same identity using the plain rewritten query
q̂ = q.rewriting hq evaluated on the composite-encoded database
Î.toComposite. This is the form ProvSQL actually runs against PostgreSQL.
The corollary statement requires [HasAltLinearOrder (BoolFunc X)] purely so
that Î.toComposite : Database (T ⊕ BoolFunc X) typechecks (via the
ValueType (T ⊕ K) instance in Provenance.Util.ValueType); any
noncomputable linear order on BoolFunc X will do.
Corollary 13 (Sen, Maniu & Senellart, Section IV-D).
For any non-aggregation query q, any BoolFunc X-annotated database Î
and any tuple t, the marginal probability that t appears in the random
output of q equals the probability of the disjunctive annotation of t
in the result of evaluating the plain rewritten query q̂ on the
composite-encoded database.
Combines theorem_12 and Query.rewriting_valid.