Publications
Papers about ProvSQL
These papers describe ProvSQL itself or the theoretical foundations on which it is built, including its data model, query rewriting architecture, knowledge compilation approach, and the semiring provenance theory that motivates its design.
2026
-
Sen, A., Maniu, S., and Senellart, P. 2026. ProvSQL: A General System for Keeping Track of the Provenance
and Probability of Data. Proc. IEEE 42nd International Conference on Data Engineering (ICDE),
Montréal, Canada, May 2026.
PDF
Slides
Poster
BibTeX
The primary reference for the ProvSQL system. Presents the complete data model (multiset-based relational algebra with semiring provenance), the circuit-based provenance representation stored in memory-mapped files, and the full query evaluation architecture. Includes TPC-H-inspired benchmarks comparing ProvSQL to GProM and MayBMS, demonstrating competitive performance and broader SQL coverage.
2025
-
Widiaatmaja, A.A., Djeffal, B., Dandekar, A., and Senellart, P. 2025. Demonstration of ProvSQL Update Provenance through Temporal Databases. Proceedings of the Provenance Week 2025, PW’25, Berlin, Germany,
June 22-27, 2025, ACM, 71–76.
PDF
Poster
DOI
BibTeX
Extends ProvSQL to track provenance for data modification operations (INSERT, UPDATE, DELETE) by introducing monus gates for deleted tuples and times gates for inserted ones. Demonstrates the utility of update provenance by implementing temporal database features—time travel, history tracking, and undo—using the union-of-intervals m-semiring.
-
Yunus, F., Karmakar, P., Senellart, P., Abdessalem, T., and Bressan, S. 2025. Using A Probabilistic Database in an Image Retrieval Application. Proceedings 28th International Conference on Extending Database Technology,
EDBT 2025, Barcelona, Spain, March 25-28, 2025, OpenProceedings.org, 1106–1109.
PDF
Video
DOI
BibTeX
Demonstrates ProvSQL as the probabilistic database backend for an image retrieval application. Images are indexed by uncertain feature vectors; queries return ranked results with tuple-independent probabilities computed via ProvSQL’s knowledge compilation pipeline. Shows that ProvSQL’s provenance-based probability management integrates naturally with a real-world multimedia retrieval use case.
2024
-
Senellart, P. 2024. On the Impact of Provenance Semiring Theory on the Design of a
Provenance-Aware Database System. The Provenance of Elegance in Computation - Essays Dedicated to Val
Tannen, Tannen’s Festschrift, University of Pennsylvania, Philadelphia,
PA, USA, May 24-25, 2024, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 9:1–9:10.
PDF
Slides
DOI
BibTeX
A reflective essay on how the provenance semiring framework of Green, Karvounarakis, and Tannen directly shaped ProvSQL’s design. Discusses where theory translates cleanly into implementation (e.g., the universality of the polynomial semiring motivating UUID-based annotations), where SQL’s multiset semantics and aggregation required deviations, and where ProvSQL’s development still lags behind theory.
-
Karmakar, P., Monet, M., Senellart, P., and Bressan, S. 2024. Expected Shapley-Like Scores of Boolean Functions: Complexity and
Applications to Probabilistic Databases. Proc. ACM Manag. Data 2, 2, 92.
PDF
Slides
Poster
DOI
BibTeX
Studies the complexity of expected Shapley and Banzhaf values for Boolean functions in probabilistic database settings, showing that their computation is interreducible in polynomial time with probabilistic query evaluation. Designs a polynomial-time algorithm for Boolean functions represented as d-DNNF circuits. Implements and experimentally validates this algorithm within the ProvSQL system, enabling Shapley value computation directly from provenance circuits.
2020
-
Amarilli, A., Capelli, F., Monet, M., and Senellart, P. 2020. Connecting Knowledge Compilation Classes and Width Parameters. Theory Comput. Syst. 64, 5, 861–914.
PDF
DOI
BibTeX
Establishes formal connections between knowledge compilation classes (OBDDs, d-DNNFs, SDDs, ...) and graph width parameters (treewidth, pathwidth, cliquewidth). Section 5.1 of this paper provides the algorithm directly implemented in ProvSQL’s internal tree-decomposition-based knowledge compiler (tdkc), which converts a provenance circuit into a d-DNNF for tractable probability evaluation.
2019
-
Senellart, P. 2019. Provenance in Databases: Principles and Applications. Reasoning Web. Explainable Artificial Intelligence - 15th International
Summer School 2019, Bolzano, Italy, September 20-24, 2019, Tutorial
Lectures, Springer, 104–109.
PDF
Slides
DOI
BibTeX
A concise tutorial on database provenance covering Boolean provenance, semiring provenance, and key applications including probabilistic databases, view maintenance, and query explanation. Also surveys provenance beyond the relational setting (XML, graph, triple-store databases). Serves as an accessible introduction to the theoretical foundations underlying ProvSQL.
2018
-
Senellart, P., Jachiet, L., Maniu, S., and Ramusat, Y. 2018. ProvSQL: Provenance and Probability Management in PostgreSQL. Proc. VLDB Endow. 11, 12, 2034–2037.
PDF
Poster
Video
DOI
BibTeX
The original demonstration paper introducing ProvSQL. Describes the core design: transparent query rewriting via a PostgreSQL planner hook, the provenance term algebra circuit as a uniform representation for semiring provenance, where-provenance, and m-semiring provenance, and knowledge compilation to d-DNNF for probabilistic query evaluation.
2017
-
Senellart, P. 2017. Provenance and Probabilities in Relational Databases. SIGMOD Rec. 46, 4, 5–15.
PDF
DOI
BibTeX
An accessible overview of provenance and probabilistic query evaluation in relational databases, covering the semiring framework, knowledge compilation to d-DNNF, and the connection to probabilistic databases. Serves as a concise introduction to the theoretical and practical ideas behind ProvSQL.
Foundational Works
These papers establish the theoretical framework on which ProvSQL is built: why- and where-provenance, provenance semirings, aggregate provenance, m-semirings for non-monotone queries, the connection between provenance and probabilistic databases, circuit-based provenance representations, and algebraic provenance for update queries.
2020
-
Bourhis, P., Deutch, D., and Moskovitch, Y. 2020. Equivalence-Invariant Algebraic Provenance for Hyperplane Update Queries. Proceedings of the 2020 International Conference on Management of
Data, SIGMOD Conference 2020, online conference [Portland, OR, USA],
June 14-19, 2020, ACM, 415–429.
DOI
BibTeX
Introduces an algebraic provenance framework for hyperplane update queries (INSERT, DELETE, UPDATE via linear constraints), defining a semiring that tracks provenance through data-modification operations in an equivalence-invariant manner. This work provides the theoretical foundation for ProvSQL’s update provenance support.
2015
-
Gatterbauer, W. and Suciu, D. 2015. Approximate Lifted Inference with Probabilistic Databases. Proc. VLDB Endow. 8, 5, 629–640.
DOI
BibTeX
Develops the dissociation framework for probabilistic query evaluation: dissociating tuples in an atom (replacing a single shared input with multiple independent copies) gives an upper bound on the query’s probability, and dissociating in a deterministic relation leaves the probability unchanged. The latter observation grounds ProvSQL’s deterministic-relation transparency pass: a relation whose rows carry no provenance can be made transparent for atom-set analysis without affecting soundness. The journal version (VLDB J. 26(1):31-59, 2017, doi:10.1007/s00778-016-0434-5) extends the framework with propagation.
2014
-
Deutch, D., Milo, T., Roy, S., and Tannen, V. 2014. Circuits for Datalog Provenance. Proc. 17th International Conference on Database Theory (ICDT),
Athens, Greece, March 24-28, 2014, OpenProceedings.org, 201–212.
DOI
BibTeX
Shows how to represent provenance of Datalog queries compactly as arithmetic circuits rather than as (potentially exponential) polynomials or sets. Establishes circuit-based provenance as a practical representation for recursive queries, with efficient evaluation algorithms. ProvSQL’s circuit-based provenance representation for all gate types is directly inspired by this work.
2012
-
Dalvi, N.N. and Suciu, D. 2012. The Dichotomy of Probabilistic Inference for Unions of Conjunctive
Queries. J. ACM 59, 6, 30:1–30:87.
DOI
BibTeX
Establishes the complexity dichotomy for probabilistic query evaluation over tuple-independent databases: every union of conjunctive queries is either in PTIME (computable by a "safe plan" using extensional probability propagation) or #P-hard. The safe class includes hierarchical CQs, which admit a read-once rewriting. ProvSQL’s safe-query rewriter (the provsql.boolean_provenance optimisation) implements the read-once rewriting for self-join-free hierarchical CQs and UCQs derived from this theory; circuits it emits are evaluated in linear time by the independent-probability method.
2011
-
Amsterdamer, Y., Deutch, D., and Tannen, V. 2011. Provenance for aggregate queries. Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on
Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens,
Greece, ACM, 153–164.
DOI
BibTeX
Extends semiring provenance to SQL aggregate queries (COUNT, SUM, AVG, etc.), which fall outside the positive relational algebra covered by Green et al. Introduces a semimodule structure over semirings to handle aggregation provenance. ProvSQL’s support for provenance of GROUP BY queries and aggregation gates is grounded in this work.
-
Suciu, D., Olteanu, D., Ré, C., and Koch, C. 2011. Probabilistic Databases. Morgan & Claypool Publishers.
DOI
BibTeX
Textbook treatment of probabilistic databases. Chapter 4 covers the safe-plan framework and the FD-aware extensions (constant-selection FDs, primary-key FDs, deterministic-relation transparency, the FD-closure on the union-find) that ProvSQL’s safe-query rewriter implements. The textbook is the most accessible reference for the framework’s correctness arguments.
2010
-
Geerts, F. and Poggi, A. 2010. On database query languages for K-relations. J. Appl. Log. 8, 2, 173–185.
DOI
BibTeX
Introduces m-semirings (semirings equipped with a monus operator ⊖) to extend the provenance semiring framework to the full relational algebra including set difference and negation. Defines the corresponding notion of K-relations for m-semirings. ProvSQL’s handling of EXCEPT queries and its monus gate type are direct implementations of this framework.
2007
-
Green, T.J., Karvounarakis, G., and Tannen, V. 2007. Provenance semirings. Proceedings of the Twenty-Sixth ACM SIGACT-SIGMOD-SIGART Symposium
on Principles of Database Systems, June 11-13, 2007, Beijing, China, ACM, 31–40.
DOI
BibTeX
The foundational paper introducing provenance semirings. Defines annotated relations over commutative semirings and shows that the standard relational algebra operators correspond to semiring operations, unifying many existing provenance formalisms (why-provenance, lineage, trio, bag semantics) as instances of a single algebraic framework. ProvSQL’s core data model is a direct implementation of this theory.
-
Dalvi, N.N. and Suciu, D. 2007. Efficient query evaluation on probabilistic databases. VLDB J. 16, 4, 523–544.
DOI
BibTeX
Introduces the safe-plan framework for probabilistic query evaluation: a syntactic characterisation of conjunctive queries whose probability is computable in PTIME by an extensional plan, together with the induced functional dependencies set Gamma_p(q) that lets the framework recognise more queries as safe. Constant selections induce empty-determinant FDs (used by ProvSQL’s constant-selection elimination) and primary keys induce schema FDs (used by the PK-FD pass). The textbook treatment is in Suciu, Olteanu, Re and Koch 2011, chapter 4. ProvSQL’s FD-aware safe-query rewriter implements the project-safety condition of this paper, with the union-find-based hierarchicality check operating on the FD closure.
2006
-
Green, T.J. and Tannen, V. 2006. Models for Incomplete and Probabilistic Information. Current Trends in Database Technology - EDBT 2006, EDBT 2006 Workshops,
Munich, Germany, March 26-31, 2006, Revised Selected Papers, Springer, 278–296.
DOI
BibTeX
Proposes a unified model for incomplete and probabilistic databases based on c-tables annotated with provenance expressions over a Boolean semiring. Shows how probabilistic query evaluation reduces to evaluating the provenance expression of a result tuple over a probability distribution. This connection between provenance and probabilistic databases is the theoretical basis for ProvSQL’s probability computation features.
2001
-
Buneman, P., Khanna, S., and Tan, W.C. 2001. Why and Where: A Characterization of Data Provenance. Database Theory - ICDT 2001, 8th International Conference, London,
UK, January 4-6, 2001, Proceedings, Springer, 316–330.
DOI
BibTeX
The paper that introduced the concepts of why-provenance and where-provenance. Why-provenance tracks which source tuples contribute to a result; where-provenance tracks which specific attribute values (locations in the source) a result value was copied from. ProvSQL implements where-provenance via project and eq gates that record the column-level origin of each output value.