Publications

Papers about ProvSQL

These papers describe ProvSQL itself or the theoretical foundations on which it is built, including its data model, query rewriting architecture, knowledge compilation approach, and the semiring provenance theory that motivates its design.

2026

Sen, A., Maniu, S., and Senellart, P. 2026. ProvSQL: A General System for Keeping Track of the Provenance and Probability of Data. Proc. IEEE 42nd International Conference on Data Engineering (ICDE), Montréal, Canada, May 2026. PDF Slides Poster BibTeX
The primary reference for the ProvSQL system. Presents the complete data model (multiset-based relational algebra with semiring provenance), the circuit-based provenance representation stored in memory-mapped files, and the full query evaluation architecture. Includes TPC-H-inspired benchmarks comparing ProvSQL to GProM and MayBMS, demonstrating competitive performance and broader SQL coverage.

2025

Widiaatmaja, A.A., Djeffal, B., Dandekar, A., and Senellart, P. 2025. Demonstration of ProvSQL Update Provenance through Temporal Databases. Proceedings of the Provenance Week 2025, PW’25, Berlin, Germany, June 22-27, 2025, ACM, 71–76. PDF Poster DOI BibTeX
Extends ProvSQL to track provenance for data modification operations (INSERT, UPDATE, DELETE) by introducing monus gates for deleted tuples and times gates for inserted ones. Demonstrates the utility of update provenance by implementing temporal database features – time travel, history tracking, and undo – using the union-of-intervals m-semiring.
Yunus, F., Karmakar, P., Senellart, P., Abdessalem, T., and Bressan, S. 2025. Using A Probabilistic Database in an Image Retrieval Application. Proceedings 28th International Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25-28, 2025, OpenProceedings.org, 1106–1109. PDF Video DOI BibTeX
Demonstrates ProvSQL as the probabilistic database backend for an image retrieval application. Images are indexed by uncertain feature vectors; queries return ranked results with tuple-independent probabilities computed via ProvSQL’s knowledge compilation pipeline. Shows that ProvSQL’s provenance-based probability management integrates naturally with a real-world multimedia retrieval use case.

2024

Senellart, P. 2024. On the Impact of Provenance Semiring Theory on the Design of a Provenance-Aware Database System. The Provenance of Elegance in Computation - Essays Dedicated to Val Tannen, Tannen’s Festschrift, University of Pennsylvania, Philadelphia, PA, USA, May 24-25, 2024, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 9:1–9:10. PDF Slides DOI BibTeX
A reflective essay on how the provenance semiring framework of Green, Karvounarakis, and Tannen directly shaped ProvSQL’s design. Discusses where theory translates cleanly into implementation (e.g., the universality of the polynomial semiring motivating UUID-based annotations), where SQL’s multiset semantics and aggregation required deviations, and where ProvSQL’s development still lags behind theory.
Karmakar, P., Monet, M., Senellart, P., and Bressan, S. 2024. Expected Shapley-Like Scores of Boolean Functions: Complexity and Applications to Probabilistic Databases. Proc. ACM Manag. Data 2, 2, 92. PDF Slides Poster DOI BibTeX
Studies the complexity of expected Shapley and Banzhaf values for Boolean functions in probabilistic database settings, showing that their computation is interreducible in polynomial time with probabilistic query evaluation. Designs a polynomial-time algorithm for Boolean functions represented as d-DNNF circuits. Implements and experimentally validates this algorithm within the ProvSQL system, enabling Shapley value computation directly from provenance circuits.

2020

Amarilli, A., Capelli, F., Monet, M., and Senellart, P. 2020. Connecting Knowledge Compilation Classes and Width Parameters. Theory Comput. Syst. 64, 5, 861–914. PDF DOI BibTeX
Establishes formal connections between knowledge compilation classes (OBDDs, d-DNNFs, SDDs…) and graph width parameters (treewidth, pathwidth, cliquewidth). Section 5.1 of this paper provides the algorithm directly implemented in ProvSQL’s internal tree-decomposition-based knowledge compiler (tdkc), which converts a provenance circuit into a d-DNNF for tractable probability evaluation.

2019

Maniu, S., Senellart, P., and Jog, S. 2019. An Experimental Study of the Treewidth of Real-World Graph Data. 22nd International Conference on Database Theory, ICDT 2019, March 26-28, 2019, Lisbon, Portugal, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 12:1–12:18. PDF DOI BibTeX
An experimental study of treewidth lower and upper bounds on real-world graph data. ProvSQL follows its findings in choosing the treewidth heuristics of its bounded-treewidth machinery: the degeneracy lower bound (cheap, to abort a hopeless decomposition early) and the min-fill upper bound (the elimination ordering that builds the tree decomposition the knowledge compiler and the recursive-reachability route run along).
Senellart, P. 2019. Provenance in Databases: Principles and Applications. Reasoning Web. Explainable Artificial Intelligence - 15th International Summer School 2019, Bolzano, Italy, September 20-24, 2019, Tutorial Lectures, Springer, 104–109. PDF Slides DOI BibTeX
A concise tutorial on database provenance covering Boolean provenance, semiring provenance, and key applications including probabilistic databases, view maintenance, and query explanation. Also surveys provenance beyond the relational setting (XML, graph, triple-store databases). Serves as an accessible introduction to the theoretical foundations underlying ProvSQL.

2018

Senellart, P., Jachiet, L., Maniu, S., and Ramusat, Y. 2018. ProvSQL: Provenance and Probability Management in PostgreSQL. Proc. VLDB Endow. 11, 12, 2034–2037. PDF Poster Video DOI BibTeX
The original demonstration paper introducing ProvSQL. Describes the core design: transparent query rewriting via a PostgreSQL planner hook, the provenance term algebra circuit as a uniform representation for semiring provenance, where-provenance, and m-semiring provenance, and knowledge compilation to d-DNNF for probabilistic query evaluation.

2017

Senellart, P. 2017. Provenance and Probabilities in Relational Databases. SIGMOD Rec. 46, 4, 5–15. PDF DOI BibTeX
An accessible overview of provenance and probabilistic query evaluation in relational databases, covering the semiring framework, knowledge compilation to d-DNNF, and the connection to probabilistic databases. Serves as a concise introduction to the theoretical and practical ideas behind ProvSQL.

2016

Amarilli, A. 2016. Leveraging the Structure of Uncertain Data. https://theses.hal.science/tel-01345836.
Chapter 4 (cc/pcc-instances and joint width) is the theoretical basis of ProvSQL’s joint-width UCQ compiler: a fixed conjunctive query that is #P-hard under the Dalvi–Suciu dichotomy is still evaluated exactly in time linear in the data whenever the joint treewidth of the data and its correlation structure is bounded (Thm. 4.2.7, Prop. 4.2.11).

2015

Amarilli, A., Bourhis, P., and Senellart, P. 2015. Provenance Circuits for Trees and Treelike Instances. Automata, Languages, and Programming - 42nd International Colloquium, ICALP 2015, Kyoto, Japan, July 6-10, 2015, Proceedings, Part II, Springer, 56–68. PDF Slides DOI BibTeX
The provenance refinement of Courcelle’s theorem: for a fixed MSO query over an instance of bounded treewidth, a provenance circuit of treewidth independent of the instance size can be built in linear time along a tree decomposition of the data. ProvSQL’s recursive-reachability route implements this construction for two-terminal network reliability: the s-t reachability query is compiled along a tree decomposition of the edge relation’s graph into a d-D (a deterministic, decomposable circuit, not in negation normal form) of linear size, giving exact linear-time evaluation of a #P-hard problem on bounded-treewidth probabilistic graphs.

Foundational Works

These papers establish the theoretical framework on which ProvSQL is built, from provenance semirings and their extensions to the knowledge-compilation and probabilistic-database results its evaluation engine relies on.

2025

Lai, Y., Meel, K.S., and Yap, R.H.C. 2025. Panini: An Efficient and Flexible Knowledge Compiler. Computer Aided Verification - 37th International Conference, CAV 2025, Zagreb, Croatia, July 23-25, 2025, Proceedings, Part III, Springer, 92–105. DOI BibTeX
Introduces Panini, the knowledge-compilation tool of the KCBox toolbox developed by the Meel group. It compiles CNF formulas into one of five target languages: OBDD, OBDD[AND], Decision-DNNF, R2-D2 (a restricted Decision-DNNF), or CCDD. ProvSQL exposes the first three through the ’panini-obdd’ / ’panini-obdd-and’ / ’panini-decdnnf’ compiler options of probability_evaluate and the Studio compile dropdown; R2-D2 and CCDD are intentionally omitted because both emit kernelize nodes that break decomposability. Panini’s output is translated to standard d-DNNF form for probability evaluation.

2021

Lagniez, J.-M. and Marquis, P. 2021. About Caching in D4 2.0. Workshop on Counting and Sampling 2021. PDF
Workshop note from the d4 authors describing d4 2.0 (a.k.a. d4v2): a rewrite of the d4 Decision-DNNF compiler with improved caching, a library-first architecture and tree-decomposition-guided branching heuristics by default. ProvSQL exposes d4v2 as the ’d4v2’ compiler option via probability_evaluate(..., ’compilation’, ’d4v2’) and the Studio compile dropdown.
Korhonen, T. and Järvisalo, M. 2021. Integrating Tree Decompositions into Decision Heuristics of Propositional Model Counters (Short Paper). 27th International Conference on Principles and Practice of Constraint Programming, CP 2021, Montpellier, France (Virtual Conference), October 25-29, 2021, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 8:1–8:11. DOI BibTeX
Introduces SharpSAT-TD: an exact (weighted) model counter built on SharpSAT that uses tree-decomposition guided decision heuristics. Conceptually parallel to ProvSQL’s in-process tree-decomposition d-DNNF builder, but as a stand-alone SOTA exact counter. ProvSQL invokes it via probability_evaluate(..., ’wmc’, ’sharpsat-td’), with the sharpsat-td and flow_cutter_pace17 binaries on PATH.

2020

Bourhis, P., Deutch, D., and Moskovitch, Y. 2020. Equivalence-Invariant Algebraic Provenance for Hyperplane Update Queries. Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, ACM, 415–429. DOI BibTeX
Introduces an algebraic provenance framework for hyperplane update queries (INSERT, DELETE, UPDATE via linear constraints), defining a semiring that tracks provenance through data-modification operations in an equivalence-invariant manner. This work provides the theoretical foundation for ProvSQL’s update provenance support.
Dudek, J.M., Phan, V.H.N., and Vardi, M.Y. 2020. DPMC: Weighted Model Counting by Dynamic Programming on Project-Join Trees. Principles and Practice of Constraint Programming - 26th International Conference, CP 2020, Louvain-la-Neuve, Belgium, September 7-11, 2020, Proceedings, Springer, 211–230. DOI BibTeX
Introduces DPMC: an exact weighted model counter that runs as a two-stage pipeline. A planner (htb) produces a project-join tree from the CNF; an executor (dmc) traverses the tree using Algebraic Decision Diagrams (ADDs) to compute the count. Algorithmically distinct from the search-based exact counters (Ganak / SharpSAT-TD), giving the benchmark table a third axis. ProvSQL exposes it via probability_evaluate(..., ’wmc’, ’dpmc’).

2019

Sharma, S., Roy, S., Soos, M., and Meel, K.S. 2019. GANAK: A Scalable Probabilistic Exact Model Counter. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, ijcai.org, 1169–1176. DOI BibTeX
Introduces Ganak, an exact (probabilistically-cached) model counter that won every track of the 2024 and 2025 Model Counting Competitions. Ganak reads weighted DIMACS in the MCC 2024 format (c p weight <lit> <w> 0 lines) and reports the count on c s exact ... lines. ProvSQL invokes Ganak via probability_evaluate(..., ’wmc’, ’ganak’), with Tseytin’d CNF + appended weight lines for each input gate.

2017

Lagniez, J.-M. and Marquis, P. 2017. An Improved Decision-DNNF Compiler. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, ijcai.org, 667–673. DOI BibTeX
Introduces d4, a top-down CNF-to-Decision-DNNF compiler with dynamic hypergraph decomposition, component caching and conflict-driven backtracking. The paper unifies the descriptions of c2d, Dsharp and d4 as Decision-DNNF compilers and reports significant size and runtime improvements over its predecessors. ProvSQL uses d4 as the default external compiler behind probability_evaluate(..., ’compilation’) and as the final fallback of its compilation pipeline.
Lai, Y., Liu, D., and Yin, M. 2017. New Canonical Representations by Augmenting OBDDs with Conjunctive Decomposition. J. Artif. Intell. Res. 58, 453–521. DOI BibTeX
Introduces OBDD[AND] (and its smooth variant): augmenting ordered binary decision diagrams with explicit decomposable conjunction nodes. OBDDs are the canonical structured BDD class in the knowledge-compilation map; this paper defines a strict super-language that retains many of OBDDs’ tractability properties while admitting more compact circuits. ProvSQL exposes Panini’s OBDD and OBDD[AND] target languages via the ’panini-obdd’ / ’panini-obdd-and’ compiler options.

2015

Gatterbauer, W. and Suciu, D. 2015. Approximate Lifted Inference with Probabilistic Databases. Proc. VLDB Endow. 8, 5, 629–640. DOI BibTeX
Develops the dissociation framework for probabilistic query evaluation: dissociating tuples in an atom (replacing a single shared input with multiple independent copies) gives an upper bound on the query’s probability, and dissociating in a deterministic relation leaves the probability unchanged. The latter observation grounds ProvSQL’s deterministic-relation transparency pass: a relation whose rows carry no provenance can be made transparent for atom-set analysis without affecting soundness. The journal version (VLDB J. 26(1):31-59, 2017, doi:10.1007/s00778-016-0434-5) extends the framework with propagation.
Oztok, U. and Darwiche, A. 2015. A Top-Down Compiler for Sentential Decision Diagrams. Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, AAAI Press, 3141–3148. BibTeX
Introduces miniC2D, a top-down CNF-to-SDD compiler. The paper isolates a subclass of SDDs called Decision-SDDs that admit a top-down compilation strategy analogous to that of c2d / Dsharp. ProvSQL invokes miniC2D as the structural-target external compiler in its probability evaluation pipeline.

2014

Deutch, D., Milo, T., Roy, S., and Tannen, V. 2014. Circuits for Datalog Provenance. Proc. 17th International Conference on Database Theory (ICDT), Athens, Greece, March 24-28, 2014, OpenProceedings.org, 201–212. DOI BibTeX
Shows how to represent provenance of Datalog queries compactly as arithmetic circuits rather than as (potentially exponential) polynomials or sets. Establishes circuit-based provenance as a practical representation for recursive queries, with efficient evaluation algorithms. ProvSQL’s circuit-based provenance representation for all gate types is directly inspired by this work, and it is the basis of ProvSQL’s absorptive provenance scheme: under provsql.provenance = ’absorptive’ a recursive query over cyclic data stops at the absorptive value fixpoint of this circuit semantics, where longer (cycle-revisiting) derivations are absorbed, instead of failing to converge.

2013

Souihli, A. and Senellart, P. 2013. Optimizing approximations of DNF query lineage in probabilistic XML. 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, IEEE Computer Society, 721–732. PDF DOI BibTeX
The ProApproX query processor: rather than committing to a single confidence-computation algorithm, it maintains a portfolio of exact and approximate methods (naive, Monte-Carlo / Karp-Luby style samplers, compilation) and a cost model that picks, per lineage formula and per requested (ε, δ) tolerance, the cheapest method expected to meet the bound. This portfolio + cost-model principle is the design behind ProvSQL’s probability-method catalog and its cost-based chooser: tolerance grants (exact / relative / additive) nest, and the chooser returns the cheapest admissible member.

2012

Dalvi, N.N. and Suciu, D. 2012. The Dichotomy of Probabilistic Inference for Unions of Conjunctive Queries. J. ACM 59, 6, 30:1–30:87. DOI BibTeX
Establishes the complexity dichotomy for probabilistic query evaluation over tuple-independent databases: every union of conjunctive queries is either in PTIME (computable by a “safe plan” using extensional probability propagation) or #P-hard. The safe class includes hierarchical CQs, which admit a read-once rewriting. ProvSQL’s safe-query rewriter (the provsql.boolean_provenance optimisation) implements the read-once rewriting for self-join-free hierarchical CQs and UCQs derived from this theory; circuits it emits are evaluated in linear time by the independent-probability method.
Jha, A.K. and Suciu, D. 2012. Probabilistic Databases with MarkoViews. Proc. VLDB Endow. 5, 11, 1160–1171. PDF DOI BibTeX
Represents tuple correlations as MarkoViews – UCQ views that attach a weight to each output tuple (a weight above 1 makes the contributing tuples positively correlated, below 1 negatively, 1 independent, and 0 a hard constraint) – and reduces query evaluation on such a database to ordinary tuple-independent evaluation plus a single conditioning step (their Theorem 1): the answer probability is the original query conditioned on the event that no constraint is violated. This is the discrete precedent for ProvSQL’s conditioning operator: a denial constraint becomes a violation query W, and conditioning a query on its non-occurrence (Q | !W) is exactly the MarkoViews reduction.
Muise, C.J., McIlraith, S.A., Beck, J.C., and Hsu, E.I. 2012. Dsharp: Fast d-DNNF Compilation with sharpSAT. Advances in Artificial Intelligence - 25th Canadian Conference on Artificial Intelligence, Canadian AI 2012, Toronto, ON, Canada, May 28-30, 2012. Proceedings, Springer, 356–361. DOI BibTeX
The Dsharp compiler: a top-down DPLL-style CNF to Decision-DNNF compiler built on the sharpSAT model counter, with component caching as the key efficiency mechanism. ProvSQL exposes Dsharp as one of the external compilers behind probability_evaluate(..., ’compilation’, ’dsharp’).

2011

Amsterdamer, Y., Deutch, D., and Tannen, V. 2011. Provenance for aggregate queries. Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece, ACM, 153–164. DOI BibTeX
Extends semiring provenance to SQL aggregate queries (COUNT, SUM, AVG, etc.), which fall outside the positive relational algebra covered by Green et al. Introduces a semimodule structure over semirings to handle aggregation provenance. ProvSQL’s support for provenance of GROUP BY queries and aggregation gates is grounded in this work.
Suciu, D., Olteanu, D., Ré, C., and Koch, C. 2011. Probabilistic Databases. Morgan & Claypool Publishers. DOI BibTeX
Textbook treatment of probabilistic databases. Chapter 4 covers the safe-plan framework and the FD-aware extensions (constant-selection FDs, primary-key FDs, deterministic-relation transparency, the FD-closure on the union-find) that ProvSQL’s safe-query rewriter implements. The textbook is the most accessible reference for the framework’s correctness arguments.
Jha, A.K. and Suciu, D. 2011. Knowledge Compilation Meets Database Theory: Compiling Queries to Decision Diagrams. Database Theory - ICDT 2011, 14th International Conference, Uppsala, Sweden, March 21-24, 2011, Proceedings, ACM, 162–173. PDF DOI BibTeX
Maps the lineage of a UCQ to four compilation targets of strictly increasing power – read-once, OBDD, FBDD, d-DNNF – and shows that over UCQ (unlike self-join-free CQ, where Olteanu-Huang collapse them all to read-once) they form a strict hierarchy. Gives exact syntactic characterisations of the first two: UCQ(OBDD) is precisely the inversion-free queries (with a linear-size OBDD construction using a query-derived variable order), and UCQ(RO) the inversion-free queries in which every relation symbol occurs at most once. Relevant to ProvSQL because it explains why a knowledge compiler working on the materialized circuit alone cannot in general recover the tractability a query-level analysis exposes.
Darwiche, A. 2011. SDD: A New Canonical Representation of Propositional Knowledge Bases. IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011, IJCAI/AAAI, 819–826. DOI BibTeX
Introduces Sentential Decision Diagrams (SDDs), a canonical representation of propositional knowledge bases structured by a fixed v-tree. SDDs generalise OBDDs, support polynomial Apply, and underlie ProvSQL’s miniC2D-based compilation route. The Decision-SDD variant on which miniC2D is built is a structural specialisation of this language.

2010

Geerts, F. and Poggi, A. 2010. On database query languages for K-relations. J. Appl. Log. 8, 2, 173–185. DOI BibTeX
Introduces m-semirings (semirings equipped with a monus operator ⊖) to extend the provenance semiring framework to the full relational algebra including set difference and negation. Defines the corresponding notion of K-relations for m-semirings. ProvSQL’s handling of EXCEPT queries and its monus gate type are direct implementations of this framework.
Olteanu, D., Huang, J., and Koch, C. 2010. Approximate confidence computation in probabilistic databases. Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA, IEEE Computer Society, 145–156. DOI BibTeX
Introduces the d-tree, an anytime decomposition for computing interval bounds on the probability of a propositional lineage formula: each node is an independent-or, independent-and, or Shannon-expansion of its sub-formula, and a cheap per-leaf bound (the “independent” heuristic of Fig. 3) is tightened by recursing into the widest interval until the additive or relative target is met. Bounds are valid at every node (Prop. 5.1/5.4), so the recursion can stop early with a certified interval. ProvSQL’s “d-tree” probability method implements this over the materialised circuit’s monotone-DNF lineage, memoised over the shared DAG so bounded-treewidth lineage stays polynomial; it serves the high-treewidth exact corner and the deterministic / low-δ approximate paths.

2009

Ré, C. and Suciu, D. 2009. The trichotomy of HAVING queries on a probabilistic database. VLDB J. 18, 5, 1091–1116. DOI BibTeX
Classifies the complexity of evaluating a HAVING predicate aggregate(y) θ k over a tuple-independent probabilistic database, for aggregate in EXISTS, MIN, MAX, COUNT, SUM, AVG, COUNT(DISTINCT) and θ a comparison. Exact evaluation is complement-symmetric and depends only on a per-aggregate “α-safety” plan property (equal to skeleton-safety for MIN/MAX/COUNT): PTIME if α-safe, else #P-hard. Approximation is direction- asymmetric (relative-error FPTRAS of p is not one of 1 − p), yielding the trichotomy safe / apx-safe (an FPTRAS exists) / hazardous (no FPRAS). The conference version (DBPL 2007) covers the exact dichotomy; this journal version adds the approximation results. ProvSQL’s closed-form HAVING-COUNT / MIN-MAX / SUM probability evaluators realise the α-safe (PTIME-exact) corner for independent contributors, and the skeleton-safety detector exposes the α-safety axis. See the probability-evaluation developer documentation for the per-(aggregate, θ) table.

2007

Green, T.J., Karvounarakis, G., and Tannen, V. 2007. Provenance semirings. Proceedings of the Twenty-Sixth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 11-13, 2007, Beijing, China, ACM, 31–40. DOI BibTeX
The foundational paper introducing provenance semirings. Defines annotated relations over commutative semirings and shows that the standard relational algebra operators correspond to semiring operations, unifying many existing provenance formalisms (why-provenance, lineage, trio, bag semantics) as instances of a single algebraic framework. ProvSQL’s core data model is a direct implementation of this theory.
Dalvi, N.N. and Suciu, D. 2007. Efficient query evaluation on probabilistic databases. VLDB J. 16, 4, 523–544. DOI BibTeX
Introduces the safe-plan framework for probabilistic query evaluation: a syntactic characterisation of conjunctive queries whose probability is computable in PTIME by an extensional plan, together with the induced functional dependencies set Γ_p(q) that lets the framework recognise more queries as safe. Constant selections induce empty-determinant FDs (used by ProvSQL’s constant-selection elimination) and primary keys induce schema FDs (used by the PK-FD pass). The textbook treatment is in Suciu, Olteanu, Re and Koch 2011, chapter 4. ProvSQL’s FD-aware safe-query rewriter implements the project-safety condition of this paper, with the union-find-based hierarchicality check operating on the FD closure.

2006

Green, T.J. and Tannen, V. 2006. Models for Incomplete and Probabilistic Information. Current Trends in Database Technology - EDBT 2006, EDBT 2006 Workshops, Munich, Germany, March 26-31, 2006, Revised Selected Papers, Springer, 278–296. DOI BibTeX
Proposes a unified model for incomplete and probabilistic databases based on c-tables annotated with provenance expressions over a Boolean semiring. Shows how probabilistic query evaluation reduces to evaluating the provenance expression of a result tuple over a probability distribution. This connection between provenance and probabilistic databases is the theoretical basis for ProvSQL’s probability computation features.

2004

Darwiche, A. 2004. New Advances in Compiling CNF into Decomposable Negation Normal Form. Proceedings of the 16th Eureopean Conference on Artificial Intelligence, ECAI’2004, including Prestigious Applicants of Intelligent Systems, PAIS 2004, Valencia, Spain, August 22-27, 2004, IOS Press, 328–332. BibTeX
Introduces the c2d compiler, which translates CNF formulas into deterministic decomposable NNF (specifically, Decision-DNNF) under the guidance of a decomposition tree (dtree). ProvSQL invokes c2d as one of the external knowledge compilers behind probability_evaluate(..., ’compilation’, ’c2d’) and behind the Studio “Compile” feature.

2001

Buneman, P., Khanna, S., and Tan, W.C. 2001. Why and Where: A Characterization of Data Provenance. Database Theory - ICDT 2001, 8th International Conference, London, UK, January 4-6, 2001, Proceedings, Springer, 316–330. DOI BibTeX
The paper that introduced the concepts of why-provenance and where-provenance. Why-provenance tracks which source tuples contribute to a result; where-provenance tracks which specific attribute values (locations in the source) a result value was copied from. ProvSQL implements where-provenance via project and eq gates that record the column-level origin of each output value.

Download full bibliography (.bib)