Provenance in databases #
This Lean 4 library provides formal definitions and proofs relevant for provenance in databases, following the semiring framework of Green, Karvounarakis & Tannen and Green & Tannen.
One of the goals of this library is to provide a formal, machine-checked semantics for the provenance-aware relational database system ProvSQL described in Sen, Maniu & Senellart.
Contents #
Core theory
Provenance.SemiringWithMonus– definition of a semiring with monus (m-semiring), the algebraic structure underlying annotated database semantics, together with general theorems about itProvenance.Database– tuples, relations, and plain databasesProvenance.Query– relational algebra (select, project, join, union, difference…)Provenance.AnnotatedDatabase– databases annotated with values in an m-semiringKProvenance.QueryAnnotatedDatabase– semantics of relational algebra over annotated databases via m-semiring operationsProvenance.QueryAnnotatedDatabaseHom– evaluation commutes with m-semiring homomorphisms (Green, Karvounarakis & Tannen, Proposition 3.5; Geerts & Poggi, Proposition 1)Provenance.QueryRewriting– alternative query evaluation by rewriting plain queries onT ⊕ K; implements rules (R1)–(R5) of Sen, Maniu & Senellart; correctness proof partially formalisedProvenance.KSemiModule– theKSemiModule K Mtypeclass (left action of aCommSemiringWithMonus Kon a commutative monoidM) and the freeK-tensor data structureKTensor K M, used to interpret the aggregation operator onK-annotated relations (Amsterdamer, Deutch & Tannen)Provenance.QueryAggregation– annotated semantics of the aggregation operator:Query.evaluateAggSumproduces one row per group with per-aggregated -column(value, K-tensor)annotations, matching Definition 7 of Sen, Maniu & Senellart. First-cut scope: top-level aggregation only,AggFunc.sumonly.Provenance.QueryAggregationHom– hom commutation forevaluateAggSum: pushing the database forward along aSemiringWithMonusHom h : K → K'is the same as pushing the K-tensor coefficients of the aggregated result forward alongh.toRingHom.Provenance.QueryEvaluateInVKHom– unified hom commutation for the Definition 7 annotated semantics:Query.evaluateAnnotatedFull_homsubsumes bothQuery.evaluateAnnotated_hom(non-Agg) andQuery.evaluateAggSum_hom(Agg) into a single theorem on the lifted output, exploiting the newSemiringWithMonusHom.map_deltafield for the δ-applied row-annotation column.Provenance.LiftedTK– the value typeLiftedTK T KextendingT ⊕ Kwith K-tensor monomials, used as the V_K interpretation domain for the rewriting of aggregate queries.Provenance.QueryEvaluateInVK–Query.evaluateInVK, the V_K-aware evaluator that interprets a rewritten query (in particular the (R5) aggregation rewriting) inLiftedTK-valued tuples; together withQuery.evaluateAnnotatedFull(the unified Definition 7 annotated semantics) andQuery.rewritingFull(the unified rewriting of (R1)–(R5) inProvenance.QueryRewriting), this realises the single rewriting correctness theoremQuery.rewriting_valid_full. R1–R3 are proven by reduction toQuery.rewriting_valid(the legacy non-Agg theorem, retained as a helper); R5 is proven directly using theTerm.castToAnnotatedTuple_evalInVK,Term.evalInVK_index_last,LiftedTK.fold_add_ann,LiftedTK.fold_add_ktensor_nonempty, andKTensor.sum_map_embedhelpers; R4 carries over the parkedsorrys fromQuery.rewriting_valid.Provenance.Having– algebraic identities behindHAVING (count)aggregate provenance: include/exclude recurrences for the JOIN and possible-world expressions, and the upward-expansion boundProvenance.Probability– intensional probabilistic query evaluation: probability distribution over Boolean valuations, probability of aBoolFunc X, and the statement of Theorem 12 of Sen, Maniu & Senellart reducingPr(t ∈ q(Î))toPr(⋁_{(t,α) ∈ ⟪q⟫^Î} α); the proof is reduced to a single structural commutation lemmarandomWorld_evaluateAnnotatedProvenance.Circuit– Boolean circuits with structural predicates and two recursive bottom-up probability evaluators: the read-once evaluator with the inclusion-exclusion correction at OR gates (Circuit.prob), and the d-D evaluator with direct summation at OR gates under decomposability + determinism (Circuit.probDD). Both evaluators are proved correct against the sum-over-valuations semantics (Sen, Maniu & Senellart, Section V-D step 1).Provenance.Tseitin– the Tseitin CNF transformation encoding a circuit as an equisatisfiable CNF overX ⊕ Circuit X. Provides syntacticLiteral/Clause/CNFtypes, the Tseitin encoder, and the bidirectional equisatisfiability theoremCircuit.tseitin_equisat(Sen, Maniu & Senellart, Section V-D step 3, before the knowledge compiler is invoked).
Algorithms
Provenance.Algorithms.CompOp– shared comparison-operator type used by the HAVING enumeration algorithmsProvenance.Algorithms.CountEnum– enumeration of valid possible worlds forHAVING count op Cpredicates: definitions ofcombinations,addExact, andcountEnum, together with the correctness theoremcountEnum_correct(Sen, Maniu & Senellart, Algorithm 2)Provenance.Algorithms.SumDP– subset-sum enumeration of valid possible worlds forHAVING sum(t) op Cpredicates: definition ofsumExactandsumDP, together with the correctness theoremsumDP_correct(Sen, Maniu & Senellart, Algorithm 1)
Concrete m-semirings (Provenance.Semirings.*)
Provenance.Semirings.Bool– the Boolean m-semiring𝔹Provenance.Semirings.BoolFunc– the Boolean-function m-semiring𝔹[X]Provenance.Semirings.Why– the Why[X] m-semiring (sets of witness sets)Provenance.Semirings.Which– the Which[X] m-semiring (lineage / Lin[X])Provenance.Semirings.How– the ℕ[X] m-semiring of multivariate polynomials; the universal provenance semiringProvenance.Semirings.Nat– the counting m-semiringℕProvenance.Semirings.Tropical– the tropical m-semiring (min-plus) overℕ ∪ {∞},ℚ ∪ {∞}, orℝ ∪ {∞}; theℚinstance is also used as a counterexample showing that the absorptive hypothesis ofHaving.F_eq_Sis genuinely required (idempotent +⊗-over-⊖distributive is not enough)Provenance.Semirings.Viterbi– the Viterbi m-semiring (max-times) over[0,1]Provenance.Semirings.MinMax– the min-max semiring over any bounded linear order (security / access control semiring and dual fuzzy semiring)Provenance.Semirings.Lukasiewicz– the Łukasiewicz (fuzzy logic) m-semiring overℚ ∩ [0,1]Provenance.Semirings.Interval,Provenance.Semirings.IntervalUnion– intervals and finite unions of intervals over a dense linear order, used for temporal databases
See Provenance.Example for an example annotated database computation.