Aggregation and Grouping
ProvSQL supports provenance tracking for GROUP BY queries and aggregate
functions [Amsterdamer et al., 2011]. The semantics follow a
semimodule model: aggregation is treated as a scalar multiplication of
provenance values.
GROUP BY Queries
When a query includes a GROUP BY clause, each output group receives an
agg gate in the provenance circuit. The children of this gate are the
provenance tokens of all input tuples that contributed to the group:
SELECT dept, COUNT(*), provenance()
FROM employees
GROUP BY dept;
The resulting provenance token encodes which input tuples were combined to produce each aggregate value.
SELECT DISTINCT
SELECT DISTINCT is modelled as a GROUP BY on all selected columns.
Each distinct output row gets a provenance token that captures all the
duplicate source rows that were merged:
SELECT DISTINCT dept, provenance()
FROM employees;
Aggregate Functions
The aggregate functions COUNT, SUM, MIN, MAX, and AVG
are all supported over provenance-tracked tables.
Arithmetic on Aggregate Results
Arithmetic, explicit casts, window functions, and other expressions
(COALESCE, GREATEST, etc.) on aggregate results are supported,
both in the same query and over subquery results:
SELECT dept, COUNT(*) * 10 FROM employees GROUP BY dept;
SELECT dept, SUM(salary) + 1000 FROM employees GROUP BY dept;
SELECT dept, string_agg(name, ', ') || ' (team)' FROM employees GROUP BY dept;
SELECT cnt::numeric FROM (SELECT COUNT(*) AS cnt FROM employees GROUP BY dept) t;
SELECT dept, cnt, SUM(cnt) OVER () FROM (SELECT dept, COUNT(*) AS cnt FROM employees GROUP BY dept) t;
SELECT dept, COALESCE(cnt, 0) FROM (SELECT dept, COUNT(*) AS cnt FROM employees GROUP BY dept) t;
SELECT dept, GREATEST(cnt, 3) FROM (SELECT dept, COUNT(*) AS cnt FROM employees GROUP BY dept) t;
When such an operation is performed, the aggregate result is cast from
its internal agg_token representation back to the original aggregate
return type (e.g., bigint for COUNT, numeric for AVG).
A warning is emitted to indicate that the provenance information is lost
in the conversion. The provenance of the aggregate group itself is still
tracked in the provsql column.
Random-Variable Aggregates
When the aggregated column has type random_variable
(see Continuous Distributions), three aggregates lift
the standard arithmetic aggregates to the distribution algebra:
sum, avg, and
product. Each returns a random_variable
rather than a scalar. See Aggregation Over Random Variables for the
semantics, empty-group identities, and worked examples.
HAVING
Simple HAVING clauses are supported:
SELECT dept, COUNT(*) AS n, provenance()
FROM employees
GROUP BY dept
HAVING COUNT(*) > 2;
HAVING clauses whose outcome is a deterministic scalar are also
supported, including conditions that wrap a random_variable
aggregate in a moment function such as
HAVING expected(avg(measurement)) > 20 (see
Continuous Distributions): the predicate is evaluated by
PostgreSQL on the surviving groups while ProvSQL still tracks the
per-group provenance.
Complex HAVING conditions that build a non-trivial expression on
top of an agg_token aggregate result (e.g., arithmetic across
multiple aggregates) are not fully supported and may produce
incorrect results or an error.
The choose Aggregate
The choose aggregate picks an arbitrary non-NULL value from a group.
It is particularly useful for modelling mutually exclusive choices
in a probabilistic setting: the provenance of the chosen value records
which input tuple was selected, enabling correct probability computation
over the choice.
SELECT city, choose(position) AS sample_position
FROM employees
GROUP BY city;
Grouping Sets
GROUPING SETS, CUBE, and ROLLUP are not supported.