Query evaluation by rewriting #
This file provides an alternative approach to evaluating queries on annotated databases:
instead of directly interpreting operators over annotated tuples, a query on T is
rewritten into a query on T ⊕ K that operates on plain tuples whose values encode
both data and provenance.
The rewriting implemented here realises rules (R1)–(R5) from Sen, Maniu & Senellart, ProvSQL: A General System for Keeping Track of the Provenance and Probability of Data.
A correctness proof that Query.rewriting agrees with Query.evaluateAnnotated is
fully formalised for rules (R1)–(R4): each operator is machine-checked end-to-end.
The Diff case splits into an unmatched_eq half (proved via the semijoin
identity Multiset.semijoin_proj_eq_filter, after bridging the
LinearOrder.toDecidableEq vs instDecidableEqSum mismatch on the inner dedup
via Query.rewriting_valid_diff_inner_dd_inst) and a matched_eq half (proved
via the keyed-projection semijoin Multiset.semijoin_keyed_proj_eq_filter, after
substituting the inner aggregation with the closed-form
Query.evaluate_agg_rewriting_eq). The (R5) aggregation correctness lives in
Provenance.QueryEvaluateInVK as Query.rewriting_valid_full, with its V_K
interpretation; the syntactic (R5) rewriting itself is in this file as
Query.rewritingAgg.
References #
Equations
- One or more equations did not get rendered due to their size.
- (Query.Rel n s).rewriting hq_2 = Query.Rel (n + 1) s
- (Π ts q_2).rewriting hq_2 = (Π fun (k : Fin (n + 1)) => if h : ↑k < n then (ts ⟨↑k, h⟩).castToAnnotatedTuple else #(Fin.last q_2.arity)) (q_2.rewriting ⋯)
- (σ φ q_2).rewriting hq_2 = σ φ.castToAnnotatedTuple (q_2.rewriting ⋯)
- (q₁ ⊎ q₂).rewriting hq_2 = (q₁.rewriting ⋯ ⊎ q₂.rewriting ⋯)
- (ε q_2).rewriting hq_2 = Query.Agg (fun (k : Fin n) => Fin.castLE ⋯ k) ![#(Fin.last n)] ![AggFunc.sum] (q_2.rewriting ⋯)
- (Query.Agg a a_1 a_2 a_3).rewriting hq_2 = False.elim hq_2
Instances For
Tuple.cast-flavoured variant of rewriting_append_left. Both Tuple.cast's and ▸'s
Eq.rec motives must syntactically agree for rw to fire on Lean v4.29; this version
matches the motive produced by Tuple.cast.
Tuple.cast-flavoured variant of rewriting_append_right.
Helper lemmas for the Dedup case of rewriting_valid #
Folding addFn over a multiset of Sum.inr k values in T⊕K reduces to the
Multiset.sum in K, wrapped in Sum.inr.
Filtering ar.toComposite by "first-n columns match Sum.inl ∘ v" and projecting to the
last column yields the Sum.inr-wrapped annotations of the matching entries of ar.
The dedup of the first-n projection of ar.toComposite is the Sum.inl-image of the dedup
of the first-projection of ar.
Auxiliary: key set of groupByKey ar equals first-projection keys of ar.
Auxiliary: if (v, w) ∈ (groupByKey ar).val, then w is the semiring-sum of annotations
of entries in ar with key v.
groupByKey ar, as a multiset, equals the dedup of the first-projection of ar, with each
key paired with the semiring-sum of annotations sharing that key.
Helper lemmas for the Diff case of rewriting_valid #
Multiset.dedup only depends on the DecidableEq instance up to subsingleton equality.
Multiset.filter only depends on the DecidablePred instance up to subsingleton equality.
Folded Filter.And over a mapped list is equivalent to the universal conjunction.
The folded join condition (#k == #(k+n+1)) for k ∈ List.range n evaluates true iff the
tuple's values at indices ofNat k and ofNat (k+n+1) agree for every k < n.
Semiring-sum over the filter, via groupByKey.find?-based lookup.
Subtraction distributes over Sum.inr in T⊕K.
Non-dedup form of dedup_toComposite_proj_first_n: the first-n projection of
ar.toComposite is the Sum.inl-lift of the first-projection of ar.
Sum.inl-lift of tuples is injective.
Filtering by "not a member of an injective image" pulls through the map.
Helper: the data part Tuple.fromComposite and AnnotatedTuple.toComposite agree on data.
The annotation part of p.toComposite is Sum.inr p.2.
Roundtrip: Tuple.fromComposite ∘ AnnotatedTuple.toComposite = id. The
composite encoding loses no information: peeling the data columns and the
annotation column back out reconstructs the original annotated tuple.
Pushforward version of Tuple.fromComposite_toComposite: mapping
Tuple.fromComposite over a composite-encoded annotated relation recovers
the original annotated relation.
Reduction of the inner Dedup ∘ Diff ∘ Proj block of the Diff rewriting:
deduping the difference of first-n projections of AR₁.toComposite and AR₂.toComposite
yields the Sum.inl-lift of the deduped "unmatched-keys" filter over the data part.
Stated using Fin.castLE (function form) and dot notation (.dedup) so the LHS
pattern matches what simp only [evaluate] produces in the Diff case of
rewriting_valid.
Relation.cast rewrites to a Multiset.map of Tuple.cast.
selFilter on Tuple.cast h (Fin.append p q) characterises the first-n
projection equality between p and q.
Arity-(2n+2) analogue of cast_append_at_ofNat_left: reading
Tuple.cast h (Fin.append p q) at index Fin.ofNat _ k.val (for k : Fin n)
returns p (k.castLE (Nat.le_succ n)). Here q : Tuple α (n+1) (rather than
Tuple α n).
Arity-(2n+2) analogue of cast_append_at_ofNat_right: reading
Tuple.cast h (Fin.append p q) at index Fin.ofNat _ (k.val+n+1) (for
k : Fin n) returns q (k.castLE (Nat.le_succ n)).
Arity-(2n+2) projection helper: reading Tuple.cast h (Fin.append p q) at index
k.castLE _ (for k : Fin (n+1)) returns p k. This is the analogue of
proj_outer_cast_append_eq_fst for the 2n+2 case (i.e., q : Tuple α (n+1)).
Arity-(2n+2) analogue of selFilter_cast_append_iff: the join condition
on Tuple.cast h (Fin.append p q) with q : Tuple (T⊕K) (n+1) characterises
equality of the first-n projections of p and q.
Filter pushes through AnnotatedRelation.toComposite via the
Tuple.fromComposite ∘ AnnotatedTuple.toComposite = id roundtrip:
filtering before taking the composite encoding equals filtering the composite
encoding by the same predicate composed with Tuple.fromComposite.
Semijoin reduction. Given multisets r : Multiset α and s : Multiset β and
a key function g : α → β, with s Nodup, the projection-after-filter of the
cartesian product (keeping pairs whose g-image matches) coincides with filtering
r to those a whose g a belongs to s. This is the multiset version of the
relational semijoin and is the structural identity behind the unmatched_eq
half of the Diff-case rewriting correctness.
Keyed-projection semijoin. Generalizes Multiset.semijoin_proj_eq_filter in two
directions: the right multiset is the image S.map val of a Nodup keyset S under a
value function val : β → γ, and the projection is an arbitrary mk : α → γ → δ rather
than Prod.fst. The compatibility hypothesis h_val asserts that key_s ∘ val is the
identity on S (i.e., val reconstructs an element whose key_s-image is the original
key). The semijoin then reduces to filtering r by key_r a ∈ S and projecting through
mk a (val (key_r a)) (the unique matching γ-value). This is the structural identity
behind the matched_eq half of the Diff-case rewriting correctness.
The Agg of q.rewriting (the inner aggregation used in both the Dedup and
Diff rewritings) evaluates to a map over the deduped data-projection of the
inner annotated relation, with each row paired (via AnnotatedTuple.toComposite)
with the semiring sum of the matching annotations.
Instance-polymorphic restatement of Query.rewriting_valid_diff_inner_dd.
Inside the Diff case of rewriting_valid, Lean's instance synthesis picks
inconsistent DecidableEq (T⊕K) instances at different positions in the goal:
the inner Multiset.dedup is elaborated with LinearOrder.toDecidableEq (via
ValueType (T⊕K)), while the surrounding Multiset.filter's decidableMem
uses instDecidableEqSum. This wrapper accepts both as explicit parameters and
bridges to the canonical helper via Subsingleton.elim.
(R5) Rewriting of top-level aggregation #
The aggregation rewriting rule (R5) of Sen, Maniu & Senellart, ProvSQL:
γ_{i₁,…,iₘ}[t₁ : f₁, …, tₙ : fₙ](q)is rewritten toγ_{i₁,…,iₘ}[t₁ * #(k+1) : f̂₁, …, tₙ * #(k+1) : f̂ₙ, #(k+1) : δ(⊕)](q).
Unlike (R1)–(R4), which keep the rewriting target in Query (T ⊕ K) and
the standard evaluate semantics, (R5)'s rewritten query is interpreted
in the K-semimodule V_K – the per-column term t_j * #(k+1) evaluates
to a K-tensor monomial α ⊗ v_j, not to a plain T ⊕ K value. The
companion evaluator Query.evaluateInVK (in
Provenance.QueryEvaluateInVK) carries that interpretation.
Query.rewritingAgg here implements the rewriting syntactically as a
Query (T ⊕ K). Its semantic correctness – the analogue of rewriting_valid
stating that ⟪Agg ...⟫_Î matches evaluateInVK (rewritingAgg ...) Î.toComposite
– is proved as the (R5) case of Query.rewriting_valid_full (in
Provenance.QueryEvaluateInVK), packaged together with the (R1)–(R4)
correctness. The R4 sorries in Query.rewriting_valid for the diff
case are carried over there as the only remaining gap.
(R5) Top-level aggregation rewriting. Produces a plain Query (T ⊕ K)
representing γ_{is}[t_j * #(k+1) : f̂_j, #(k+1) : δ(⊕)](q.rewriting).
The inner query q is required to be noAgg (the ICDE paper imposes
aggregation-at-root); q.rewriting handles its (R1)–(R4) rewriting and
the resulting query operates on tuples of arity m+1 (the original m
data columns plus one annotation column). The output Agg has n₁
grouping columns, n₂+1 aggregated columns (the original n₂ plus the
δ(⊕) annotation column at the end), and arity n₁ + n₂ + 1.
Equations
- One or more equations did not get rendered due to their size.
Instances For
Unified rewriting for queries with at-most-top-level aggregation #
A query is well-formed for rewriting if it has no aggregation at all (corresponding to the (R1)–(R4) cases) or if it is a top-level aggregation whose inner query has no aggregation (the (R5) case). This is the structural precondition imposed by the ICDE paper, which restricts the aggregation operator to the root of a query plan.
Equations
- (Query.Agg a a_1 a_2 q_inner).wellFormed = q_inner.noAgg
- x✝.wellFormed = x✝.noAgg
Instances For
Unified rewriting: dispatches between the (R1)–(R4) rewriting for non-aggregating queries and the (R5) rewriting for top-level aggregations. The single function realises the rewriting rules (R1)–(R5) of Sen, Maniu & Senellart together.
Equations
- (Query.Agg is ts as q_inner).rewritingFull h = Query.rewritingAgg is ts as q_inner h
- (Query.Rel x s).rewritingFull h = (Query.Rel x s).rewriting h
- (Π ts q').rewritingFull h = (Π ts q').rewriting h
- (σ φ q').rewritingFull h = (σ φ q').rewriting h
- (q₁ × q₂).rewritingFull h = (q₁ × q₂).rewriting h
- (q₁ ⊎ q₂).rewritingFull h = (q₁ ⊎ q₂).rewriting h
- (ε q').rewritingFull h = (ε q').rewriting h
- (q₁ - q₂).rewritingFull h = (q₁ - q₂).rewriting h