Documentation

Provenance.Probability

Probability distributions over Boolean variables #

This file defines the intensional probability semantics underlying ProvSQL's probabilistic query evaluation, following Section IV-D of Sen, Maniu & Senellart, ProvSQL: A General System for Keeping Track of the Provenance and Probability of Data.

Given a finite set X of Boolean variables and an assignment Pr : X → ℚ of probabilities (with values in [0, 1]), we extend Pr to:

This is the foundation needed to state and prove Theorem 12 of the paper (intensional probabilistic query evaluation correctness): for any non-aggregation query q, any BoolFunc X-instance Î and any tuple t, Pr(t ∈ q(Î)) = Pr(⋁_{(t,α) ∈ ⟪q⟫^Î} α). The theorem itself is stated below and the algebraic-bridge lemmas are developed; the full proof depends on a “Bool-annotated semantics tracks plain semantics on the boolean-true support” result that is left as future work.

Main definitions #

Main results #

References #

structure ProbAssignment (X : Type) :

A probability assignment to a finite set X of Boolean variables: each variable is assigned a rational probability in [0, 1].

  • prob : X

    The probability assigned to each variable.

  • prob_nonneg (x : X) : 0 self.prob x

    Probabilities are non-negative.

  • prob_le_one (x : X) : self.prob x 1

    Probabilities are at most 1.

Instances For
    def ProbAssignment.valProb {X : Type} [Fintype X] (P : ProbAssignment X) (v : XBool) :

    Probability of a single valuation v : X → Bool, under the independence assumption: Pr(v) = ∏_{v(x)=⊤} Pr(x) · ∏_{v(x)=⊥} (1 - Pr(x)).

    Equations
    Instances For
      theorem ProbAssignment.valProb_nonneg {X : Type} [Fintype X] (P : ProbAssignment X) (v : XBool) :
      0 P.valProb v
      theorem ProbAssignment.valProb_le_one {X : Type} [Fintype X] (P : ProbAssignment X) (v : XBool) :
      P.valProb v 1
      theorem ProbAssignment.sum_valProb_eq_one {X : Type} [Fintype X] [DecidableEq X] (P : ProbAssignment X) :
      v : XBool, P.valProb v = 1

      The valuations form a probability distribution: ∑ v, Pr(v) = 1.

      Probability of a Boolean function: Pr(f) = ∑_{v ⊨ f} Pr(v).

      Equations
      Instances For

        Pr(f) ≤ ∑_v Pr(v) = 1.

        Pr(0) = 0: the constant-false function has probability zero.

        Pr(1) = 1: the constant-true function has probability one.

        theorem ProbAssignment.funcProb_congr {X : Type} [Fintype X] [DecidableEq X] (P : ProbAssignment X) {f g : BoolFunc X} (h : ∀ (v : XBool), f v = g v) :

        Pointwise-equal Boolean functions have equal probabilities.

        theorem ProbAssignment.funcProb_eq_filter_sum {X : Type} [Fintype X] [DecidableEq X] (P : ProbAssignment X) (f : BoolFunc X) :
        P.funcProb f = v : XBool with f v = true, P.valProb v

        Reformulation: Pr(f) as a sum over the satisfying valuations.

        @[implicit_reducible]
        noncomputable instance instDecidableEqBoolFunc {X : Type} :

        For finite X, equality of Boolean functions (X → Bool) → Bool is decidable in principle (since the function space is finite). We expose the classical decidability instance so that Query.evaluateAnnotated, which requires [DecidableEq K], can be invoked for K = BoolFunc X.

        Equations
        @[implicit_reducible]
        instance instMembershipRelation {T : Type} {n : } :

        Relation T n is a def for Multiset (Tuple T n), so the standard Membership instance does not propagate automatically. Re-expose it so that t ∈ q.evaluate d typechecks. (We deliberately do not use this instance inside the body of randomWorld, which returns a bare Multiset so that Multiset.mem_map rewrites apply cleanly.)

        Equations
        @[implicit_reducible]
        instance instDecidableMemRelation {T : Type} [ValueType T] {n : } (t : Tuple T n) (r : Relation T n) :

        Decidability of t ∈ q.evaluate d (with [ValueType T]).

        Equations

        Random worlds and the disjunctive tuple annotation #

        We now move toward Theorem 12 of the paper. Two pieces of infrastructure are needed: the random world of a BoolFunc X-annotated relation under a valuation v : X → Bool (the plain relation containing exactly the data parts of the annotated tuples whose annotation evaluates to true at v), and the disjunctive tuple annotation ⋁_{(t,α) ∈ r} α (a single Boolean function summarising all the ways t can appear in r).

        def tupleAnnotation {X T : Type} [ValueType T] {n : } (r : AnnotatedRelation T (BoolFunc X) n) (t : Tuple T n) :

        The disjunctive tuple annotation tupleAnnotation r t = ⋁_{(t,α) ∈ r} α: the OR over the annotations of all annotated tuples in r whose data part equals t. In the m-semiring BoolFunc X, multiset sum is pointwise OR, so this is exactly the disjunction the paper writes inside Pr(·) on the right-hand side of Theorem 12.

        Equations
        Instances For
          def randomWorld {X T : Type} {n : } (v : XBool) (r : AnnotatedRelation T (BoolFunc X) n) :

          The random world of a BoolFunc X-annotated relation under a valuation v: the plain relation consisting of the data parts of the annotated tuples whose annotation evaluates to true at v. The return type is a bare Multiset (Tuple T n) (rather than Relation T n) so that Multiset lemmas apply without an extra unfolding step.

          Equations
          Instances For

            Pointwise random world of a BoolFunc X-annotated database: each annotated relation is replaced by its plain random-world projection.

            Equations
            Instances For

              Pointwise meaning of tupleAnnotation #

              (tupleAnnotation r t)(v) = true iff some annotated tuple (t, α) ∈ r has α(v) = true. This is the connection between the disjunction-on-the-right of Theorem 12 and the random-world picture on the left.

              theorem tupleAnnotation_apply_eq_true_iff {X T : Type} [ValueType T] {n : } (r : AnnotatedRelation T (BoolFunc X) n) (t : Tuple T n) (v : XBool) :

              Pointwise reading of tupleAnnotation. (tupleAnnotation r t)(v) = true exactly when the random world at v of r contains t.

              Marginal probability and the statement of Theorem 12 #

              The marginal probability Pr(t ∈ q(Î)) is defined as the sum over valuations v of Pr(v) indexed by whether t appears in q.evaluate (Î.randomWorld v). This is the standard “intensional” definition: enumerate possible worlds, weight each by its probability, and accumulate the indicator that the query output contains t.

              The paper writes the same thing as ∑_J [t ∈ ⟦q⟧(J)] · Pr(J) over sub-instances J ⊆ Î. The two sums agree because, for each valuation v, the unique J whose characteristic Boolean function Φ_J(Î) is satisfied at v is exactly J(v) = { (u, α) ∈ Î | α(v) = true }, whose data side is Î.randomWorld v.

              noncomputable def ProbAssignment.marginalProb {X : Type} [Fintype X] [DecidableEq X] {T : Type} [ValueType T] (P : ProbAssignment X) {n : } (q : Query T n) (Î : AnnotatedDatabase T (BoolFunc X)) (t : Tuple T n) :

              Marginal probability that the tuple t appears in the output of q when evaluated on a random world of Î.

              Equations
              Instances For

                Random worlds commute with annotated query evaluation #

                The structural heart of Theorem 12 is the following commutation: for any non-aggregation query q, taking the random world of the annotated query result gives the same multiset as evaluating q on the plain random-world database.

                  randomWorld v (evaluateAnnotated q Î)  =  q.evaluate (Î.randomWorld v)
                

                Once this holds, Theorem 12 follows by summing Pr(v) weighted by the matching indicators over v, using tupleAnnotation_apply_eq_true_iff on the right-hand side and the definition of marginalProb on the left.

                Helper lemmas: random-world commutes with Multiset operations #

                @[simp]
                theorem randomWorld_zero {X T : Type} {n : } (v : XBool) :
                theorem randomWorld_add {X T : Type} {n : } (v : XBool) (r₁ r₂ : AnnotatedRelation T (BoolFunc X) n) :
                randomWorld v (r₁ + r₂) = randomWorld v r₁ + randomWorld v r₂

                randomWorld is additive on relations: filtering and projecting the data side commutes with multiset sum.

                theorem randomWorld_filter_data {X T : Type} {n : } (v : XBool) (φ : Tuple T nProp) [DecidablePred φ] (r : AnnotatedRelation T (BoolFunc X) n) :
                Multiset.filter φ (randomWorld v r) = randomWorld v (Multiset.filter (fun (p : Tuple T n × BoolFunc X) => φ p.1) r)

                Filtering the data side commutes with randomWorld v.

                theorem randomWorld_map_data {X T : Type} {n m : } (v : XBool) (f : Tuple T nTuple T m) (r : AnnotatedRelation T (BoolFunc X) n) :
                Multiset.map f (randomWorld v r) = randomWorld v (Multiset.map (fun (p : AnnotatedTuple T (BoolFunc X) n) => (f p.1, p.2)) r)

                Mapping the data side commutes with randomWorld v. Proved by Multiset.induction_on, with all Multiset.filter / Multiset.map lemmas called with named (p := ...) / explicit-type arguments so Lean's HOU does not pick a wrong decomposition and so the underlying Lex-unfolded carrier type matches between goal and rewrite.

                Random world commutes with find #

                Diff annotation helper #

                For the Diff case of the structural commutation theorem we need to characterise when the annotation subtracted from r₁'s entries evaluates to false at the valuation v: this happens exactly when the data tuple is not in the random world of r₂.

                theorem diff_annotation_eq_false_iff {X T : Type} [ValueType T] {n : } (v : XBool) (r₂ : AnnotatedRelation T (BoolFunc X) n) (u : Tuple T n) :
                (Option.map Prod.snd (List.find? (fun (q : Tuple T n × BoolFunc X) => decide (q.1 = u)) (groupByKey r₂))).getD 0 v = false urandomWorld v r₂

                The Diff subtraction-annotation evaluates to false at v iff the data tuple is absent from the random world of the subtracted relation.

                The structural commutation theorem #

                Random-world projection commutes with annotated query evaluation: for any non-aggregation query q, taking the random world v of the annotated result is the same as evaluating q on the random-world database. The proof is a structural induction, with the Prod, Dedup, and Diff cases still to be discharged.

                theorem randomWorld_evaluateAnnotated {X T : Type} [ValueType T] {n : } (q : Query T n) (hq : q.noAgg) (Î : AnnotatedDatabase T (BoolFunc X)) (v : XBool) :
                theorem ProbAssignment.theorem_12 {X : Type} [Fintype X] [DecidableEq X] {T : Type} [ValueType T] (P : ProbAssignment X) {n : } (q : Query T n) (hq : q.noAgg) (Î : AnnotatedDatabase T (BoolFunc X)) (t : Tuple T n) :

                Theorem 12 (Sen, Maniu & Senellart, Section IV-D). For any non-aggregation query q, any BoolFunc X-annotated database Î and any tuple t, the marginal probability that t appears in the random output of q equals the probability of the disjunctive tuple annotation of t in the annotated query result ⟪q⟫^Î.

                This is the formal justification for ProvSQL's intensional approach to probabilistic query evaluation: instead of enumerating exponentially-many possible worlds, evaluate the query once over BoolFunc X-annotations and take the probability of the resulting Boolean function.

                The proof reduces to (a) tupleAnnotation_apply_eq_true_iff, the pointwise reading of the disjunctive annotation, and (b) randomWorld_evaluateAnnotated, the commutation of plain query evaluation with random-world projection.

                Corollary 13: probability via the plain rewritten query #

                Theorem 12 expresses the marginal probability Pr(t ∈ q(Î)) as the probability of the disjunctive tuple annotation of t in the annotated query result ⟪q⟫^Î. Combining it with the rewriting-correctness theorem Query.rewriting_valid (Theorem 10 of Sen, Maniu & Senellart, rules R1–R5) gives the same identity using the plain rewritten query q̂ = q.rewriting hq evaluated on the composite-encoded database Î.toComposite. This is the form ProvSQL actually runs against PostgreSQL.

                The corollary statement requires [HasAltLinearOrder (BoolFunc X)] purely so that Î.toComposite : Database (T ⊕ BoolFunc X) typechecks (via the ValueType (T ⊕ K) instance in Provenance.Util.ValueType); any noncomputable linear order on BoolFunc X will do.

                Corollary 13 (Sen, Maniu & Senellart, Section IV-D). For any non-aggregation query q, any BoolFunc X-annotated database Î and any tuple t, the marginal probability that t appears in the random output of q equals the probability of the disjunctive annotation of t in the result of evaluating the plain rewritten query on the composite-encoded database.

                Combines theorem_12 and Query.rewriting_valid.