Aggregation and Grouping
ProvSQL supports provenance tracking for GROUP BY queries and aggregate
functions [Amsterdamer et al., 2011]. The semantics follow a
semimodule model: aggregation is treated as a scalar multiplication of
provenance values.
GROUP BY Queries
When a query includes a GROUP BY clause, each output group receives an
agg gate in the provenance circuit. The children of this gate are the
provenance tokens of all input tuples that contributed to the group:
SELECT dept, COUNT(*), provenance()
FROM employees
GROUP BY dept;
The resulting provenance token encodes which input tuples were combined to produce each aggregate value.
SELECT DISTINCT
SELECT DISTINCT is modelled as a GROUP BY on all selected columns.
Each distinct output row gets a provenance token that captures all the
duplicate source rows that were merged:
SELECT DISTINCT dept, provenance()
FROM employees;
Aggregate Functions
The aggregate functions COUNT, SUM, MIN, MAX, and AVG
are all supported over provenance-tracked tables.
HAVING
Simple HAVING clauses are supported:
SELECT dept, COUNT(*) AS n, provenance()
FROM employees
GROUP BY dept
HAVING COUNT(*) > 2;
Complex HAVING conditions that involve provenance-tracked aggregates
(e.g., a HAVING on the result of a computation over an aggregate)
are not fully supported and may produce incorrect results or an error.
Grouping Sets
GROUPING SETS, CUBE, and ROLLUP are not supported.